- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
Linear Models and Linear Regression
Linear models and linear regression techniques are the most fundamental methods available to the analyst for predictive modeling; we review these methods next.
Basics: Linear Models
A mathematical model is an expression that describes the relationship between two or more measures. Businesses use models in many ways—pricing is a familiar example. If the price of one widget is five dollars, the price of many widgets is y = 5*x, where y is the total price quoted and x is the number of widgets bought. If you express pricing as a mathematical model, you can build the model formula into point-of-sale devices, online quote systems, and a host of other useful applications. (Of course, because organizations set prices for their products, you don’t need a statistician to discover the pricing model; you can simply call the Pricing department. We’re just using pricing as an everyday example.)
A linear model is a mathematical model in which the relationship between an independent variable and the dependent variable is constant for all values of the independent variable. In other words, if y = 2x when x = 2, this formula will also be true if x = 4, x = 4,000,000, or any arbitrary value.
A linear model can also include a constant. Suppose that the pricing includes a shipping and handling fee of 50 dollars; now, the pricing model is y = 50 + 5*x. It is easy to visualize a linear model with a single variable and a constant (see Exhibit 9.3).
Exhibit 9.3 Linear Model with One Variable and a Constant
A linear model can include more than one predictor as long as the predictors are additive. For example, if the price of a gadget is two dollars, the total price of an order is y = 50 + 5*x1 + 2*x2, where x1 is the number of widgets and x2 is the number of gadgets. You can extend this model to include any number of items as long as the total quote is simply the sum of the quote for individual items plus a constant.
Generalizing from the pricing example, a linear model is one that you can express as y = b + a1x1 + a2x2 + ... + anxn, where y is the response measure and x1...xn are the predictors. Statisticians call the remaining values in the equation parameters; they include the value of b, a constant, and the values a1 through an, called coefficients. The coefficients represent the relationship between the predictors and the response measure; when there is a single predictor, this is the slope of a line representing the function.
If you want to use a linear model for prediction, you need to know the values of its parameters. In the pricing example, this is trivial, because the business decides the parameters for the pricing model. If you want to use a linear model to predict something complex and unknown—such as the future payment behavior of credit card customers—you need to estimate the value of model parameters. You could simply guess at the values of the parameters, but if you want to have some confidence in your predictions, you will use a statistical technique called linear regression to estimate the parameters from historical data.
To summarize, linear models are one kind of mathematical model with properties that make them easy to interpret and deploy. Linear regression is one of the techniques statisticians use to estimate the parameters of a linear model. The linear model is the result of analysis; linear regression is a tool used to accomplish this end.
Basics: Linear Regression
When you do not know the parameters of a hypothetical linear model in advance, linear regression is the method you use to estimate those parameters. Linear regression scans the data and computes parameters for the linear model that “best” fits the data. The method chooses an optimal model through the least squares criterion, which minimizes the squared errors between predicted and actual values.
Suppose that you are interested in predicting the total crop yield from small farms, and you believe that the number of acres in production is the single most important predictor of total yield. (The farms are all in the same general area and use similar practices.) When you plot total yield against acres in production for a sample of 100 farms, you get the graph like the one shown in Exhibit 9.4. The dashed line is the linear regression line.
Exhibit 9.4 Linear Regression
Linear regression is a powerful and widely used method that is pervasive in statistical packages and relatively easy to implement. However, the method has a number of properties that limit its application, require the analyst to prepare the data in certain ways or, in the worst case, lead to spurious results.
Among the limiting factors, the most important is an assumption that the response measure is a continuous numeric variable. Although it is possible to fit a regression model to a categorical response measure, the results are likely to be inferior to what the analyst could achieve using methods designed for categorical response measures, which we discuss in a later section.
Two characteristics of regression require the analyst to take additional steps to prepare the data. Like most statistical methods, regression requires that all fields specified in the model have a value, and will remove records with missing values from the analysis. Regression also requires continuous numeric predictors. Analysts can work around the missing data problem through exhaustive quality control when gathering data, or by imputing values for missing fields. Analysts can also handle categorical variables in linear regression through a method called dummy coding. Statistical software packages vary widely in the degree to which they automate these tasks for the analyst.
Analysts are most concerned with those characteristics of linear regression that produce an inferior or spurious model. For example, linear regression presumes that a linear model is the appropriate theoretical model to represent the behavior you seek to analyze. The point is important because the regression algorithm does not know the true theoretical model and will attempt to estimate model parameters from data regardless of the true state of affairs. Exhibit 9.5 shows an example of a spurious relationship.
Exhibit 9.5 Chart Showing Spurious Regression Line
The analyst detects a weak model by inspecting model diagnostics. However, it is theoretically possible for a regression model to identify a statistically significant relationship between two variables when no causal relationship exists between them in the real world.
For each model specification, linear regression packages report a key statistic called the coefficient of determination, or R-squared. This statistic measures how well the model fits the data; conceptually, it measures variation in the response measure explained by the model as a percentage of the total variation in the response measure. Analysts use this measure together with its associated F-test to determine the quality of the model. If the R-squared is low, the analyst will look for ways to improve the model, either by adding more predictors or by using a different method.
Analysts also examine the significance tests for each model coefficient. If a coefficient fails a significance test, the implication is that its true value is zero, and the associated predictor does not meaningfully contribute to the model. Good modeling practice calls for dropping this predictor from the model specification and re-estimating the model.
If two or more predictors are highly correlated, estimated values of the coefficients can be highly unstable. This condition, known as multicollinearity, does not impair the overall ability of the model to predict, but it renders the model less useful for explanatory analysis.
Advantages and Disadvantages
The principal advantage of linear regression is its simplicity, interpretability, scientific acceptance, and widespread availability. Linear regression is the first method to use for many problems. Analysts can use linear regression together with techniques such as variable recoding, transformation, or segmentation.
Its principal disadvantage is that many real-world phenomena simply do not correspond to the assumptions of a linear model; in these cases, it is difficult or impossible to produce useful results with linear regression.
Linear regression is widely available in statistical software packages and business intelligence tools.