- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
Statistics and Machine Learning
There are two classes of techniques for predictive analytics with very different legacies: statistical methods and machine learning.
Statistical methods, such as linear regression, estimate the parameters of mathematical models with known properties; the analyst seeks to test the hypothesis that the behavior of interest conforms to a specific class of mathematical model. The advantage of these models is that they are highly generalizable. If you can demonstrate that historical data conforms to a known distribution, you can use this information to predict behavior for new cases.
For example, if you know the position, velocity, and acceleration of an artillery shell, you can predict where it will land because you can use a mathematical model to compute the point of impact. By analogy, if you can show that response to a marketing campaign follows a known statistical distribution, you can predict response with a degree of confidence based on information about the customer’s past purchases, demographics, characteristics of the offer, and so forth.
The principal disadvantage of statistical methods is that real-world phenomena frequently do not conform to known statistical distributions.
Machine learning techniques differ fundamentally from statistical techniques because they do not start from a particular hypothesis about behavior; instead, they seek to learn and describe the relationship between historical facts and target behavior as closely as possible. Because machine learning techniques are not constrained by specific statistical distributions, they are often able to build models that are more accurate.
However, machine learning techniques can overlearn, which means they learn relationships in the training data that cannot generalize to the population. Consequently, most widely used machine learning techniques have built-in mechanisms to control overlearning, such as cross-validation or pruning on an independent sample.
The distinction between statistics and machine learning is getting smaller, as the two fields converge; for example, stepwise regression is a hybrid method based on both traditions.