- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
The Impact of Big Data
By “Big Data,” we mean data sets that are “big” on any one of three dimensions: volume, variety, and velocity. One of the premises of this book is that Big Data technology has already changed the analytics landscape and that a new approach is needed—what we call “Modern Analytics.”
How big is “Big?” For data management, data is Big Data if it is too large to fit efficiently in a relational database. For analytics, we use a different definition; data qualifies as Big Data if it meets any one of three conditions:
- The analytic data set is too large to fit into memory on a single machine.
- The analytic data set is too large to move to a dedicated analysis platform.
- Source data for analysis resides in a Big Data repository, such as Hadoop, an MPP database, NoSQL database, or NewSQL database.
Data volume can mean two different things with different implications for the analyst. When the analyst works with structured data in matrices or tables, “volume” can mean more rows, more columns, or both. Analysts routinely work with data sets containing millions or billions of rows by sampling records at random1 and then using the sample to train and validate predictive models. Sampling works reasonably well when the goal is to build a single predictive model for the entire population and the incidence of modeled behavior is relatively high and uniform in the population. With modern analytics technology, however, sampling is an option and not a requirement forced on the analyst by limited computing resources.
Adding more columns to the analytic data set affects the analyst in a very different way. The most effective way to improve the performance of predictive models is to add new variables with information value; however, you cannot always know in advance what variables will add value to a model. This means that as you add variables to the analytics data set, you need tooling that will enable the analyst to scan across many variables quickly to find those that add value to a predictive model.
Having many columns or variables also means that there are many possible ways to specify a predictive model. To illustrate this point, consider the simple example of an analytics data set with one response measure and five predictors—a tiny data set by any measures. There are 29 unique combinations of the five predictors as main effects and many other possible model specifications if you consider interaction effects and various transformations of the predictors. The number of possible model specifications explodes as the number of variables increases; this places a premium on methods and techniques that enable the analyst to search efficiently for the best model.
“Variety” means working with data that is not structured in matrix or table form. In itself, this is not new; analysts have worked with data in many different formats for years, and text mining is a mature field. The most important change introduced by the Big Data trend is the large-scale adoption of unstructured formats for analytic data stores and the growing recognition that unstructured data—web logs, medical provider notes, social media comments, and so on—offers significant value for predictive modeling. This means that analysts must consider unstructured data sources when planning projects and build the necessary tooling into enterprise analytics architecture.
“Velocity,” the third V of Big Data, affects predictive analytics in two ways: as source and as target. Analysts working with streaming data, such as telemetry from a racing car or live feeds from monitoring equipment in a hospital intensive care unit, must use special techniques to sample and window the data stream; these techniques convert the continuous stream into a discrete time series for analysis.
When the analyst seeks to apply predictive analytics to streaming data, as in real-time scoring, most organizations will use a dedicated decision engine designed to deliver high performance when scoring individual transactions.