- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
Survival Analysis
For some business applications, the response measure you want to predict is the elapsed time to an event. This can be literally a lifetime, if you model human mortality for life insurance; or, it can be time to failure for a device, time to attrition for a customer account, or any other similar situation in which you want to predict survival.
Time-to-event measures pose unique problems for the analyst. Suppose that you want to predict the survival time for patients receiving an experimental cancer treatment. After three years, some of the patients in the study have died, and you can compute the survival time for each of these patients. However, many of the patients are still living at the end of three years; you do not yet know their ultimate survival time. Statisticians call this problem censoring, a problem that surfaces when you try to model a time-to-event response measure using data captured over a limited time period.
The two kinds of censoring are right censoring and left censoring. If you only know that the pertinent event is after some date, as is the case for patients in the preceding example who survive to the end of the study, the data is right-censored. On the other hand, if you only know that the beginning of the pertinent time-to-event took place before a certain date, the data is left-censored. For example, if you know that every patient in the study received the experimental treatment before the study started but do not know the exact date of treatment, the data is left-censored. Data can be both right-censored and left-censored.
Survival analysis is a family of techniques developed to work with censored time-to-event response measures. Note that if censoring is not present, you may be able to model time-to-event using standard modeling techniques. For some studies, however, you would have to wait a very long time before every sampled observation has a terminal event; in the case of the experimental cancer treatment, some patients might live another 20 years. Hence, survival analysis techniques enable the analyst to take full advantage of available data without waiting until every treated patient dies, every sampled part fails, or every tracked account closes.
In addition to the censoring problem described previously, time-to-event response measures generally follow an exponential, or Weibull, distribution rather than a normal distribution; consequently, linear regression tends to perform poorly. Three alternative techniques are used widely for this problem:
- Cox’s proportional hazards model
- Exponential regression
- Log-normal regression
Cox’s proportional hazards (CPH) model is a nonparametric method, which means that it makes no assumptions about the distribution of the response measure. CPH models the underlying hazard rate (for example, risk of death) as a function of a baseline hazard rate and the incremental effects of predictor variables. Exponential regression assumes that the time-to-event response measure follows an exponential distribution. In log-normal regression, the analyst replaces the raw survival response measure with its natural logarithm and then uses standard regression tools to model the transformed measure. Log-normal regression is the simplest technique to implement but may not perform as well as CPH or exponential regression.
Popular statistical packages (such as SAS, SPSS, and Statistica) support all three methods. There are many packages for survival analysis in open source R.