- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
Logistic Regression
Linear regression is powerful and widely used. In real-world applications, however, analysts often seek to model categorical behavior:
- Prospects either respond or do not respond to a marketing communication.
- Borrowers repay a loan or do not repay a loan.
- Shoppers choose Brand X over Brand Y and Brand Z.
It is frequently possible to model this behavior with linear regression by coding the response measure as 1 (if the prospect responds) and 0 (if the prospect does not respond), but another technique called logistic regression produces better and more useful results. Statisticians developed logistic regression specifically to model the relationship between a categorical dependent variable and one or more response measures. As with linear regression, the predictors are ordinarily continuous, but experienced analysts work around this requirement through dummy coding.
Analysts use logistic regression to address three types of classification problems. The first is binomial classification, in which the response measure has only two levels: a prospect either responds or does not respond. A second type of classification problem is multinomial ordinal classification, in which the response measure can have more than two values, but there is an implied rank ordering: surveyed customers report that they are “very satisfied,” “somewhat satisfied,” “somewhat dissatisfied,” or “very dissatisfied.” The third type of classification problem is multinomial cardinal classification, in which the response measure can have more than two values and there is no implied rank ordering: surveyed customers can choose among “Chevrolet,” “Ford,” “Honda,” and “Toyota.”
Logistic regression produces estimates of the model intercept and coefficients, together with quality statistics for the individual parameters and the model as a whole. When applied to new data, the logistic regression produces a probability ranging from zero to one reflecting the relative likelihood that the case belongs to the target class, given the known values of predictor variables. For use in decision making, the analyst uses this predicted probability together with a cutoff rule to classify each new case.
The most widely used method to estimate logistic regression models is the maximum likelihood algorithm. Maximum likelihood is an iterative algorithm; it assigns initial values to the model coefficients, tests the initial solution against training data, improves the model, and iterates, improving and testing until it can find no more improvements. Software implementations of logistic regression generally offer the analyst the ability to specify details of the model quality measure, significance thresholds for model improvements, and total number of iterations.
In some cases, the maximum likelihood algorithm reaches the maximum number of iterations before it can find a meaningful solution. This can happen when predictors are highly correlated, with sparse matrices, or when the number of predictors is very large relative to the number of cases. Analysts address the problem of correlated predictors with dimension-reduction techniques, which they apply to the data before running logistic regression. There are techniques for use with sparse matrices and high dimension data; we discuss each separately.
Almost all commercial statistical packages offer an implementation of logistic regression. The method is also widely available in open source versions, with more than 50 versions available in open source R alone.