Introduction to Predictive Analytics and Data Mining
Data mining is one of the most important enablers of data analytics in general and predictive analytics in specific. Although its roots date back to the late 1980s and early 1990s, the most powerful data mining applications have been developed since the turn of the 21st century. Many people believe that the recent popularity of analytics can largely be credited to the increasing use of data mining, which is the process of extracting and providing much-needed insight and knowledge to decision makers at all levels of the managerial hierarchy. The term data mining was originally used to describe the process through which previously unknown patterns in data were discovered. Software vendors and consulting companies have expanded this definition to include most forms of data analysis in order to increase its reach and capability and thereby increase sales of software tools and services related to data mining. With the emergence of analytics as an overarching term for all data analyses, data mining is put back into its proper place—on the analytics continuum where the new discovery of knowledge happens.
In an article in Harvard Business Review, Thomas Davenport (2006), a well-known and respected expert in the field of business analytics, argued that the latest strategic weapon for today’s businesses is analytical decision making that is based on discovery of new knowledge through data mining. He provided examples of companies such as Amazon.com, Capital One, Marriott International, and others that have used (and are still using) analytics to better understand their customers and optimize their extended supply chains in order to maximize their returns on investment while providing the best possible customer service. This level of success can happen only if the company pursues all avenues, including analytics at all three levels—descriptive, predictive, and prescriptive—to intimately understand its customers and their needs and wants, along with their vendors, business processes, and the extended supply chain.
Data mining is the process of converting data into information and then into knowledge. In the context of knowledge management, data mining is the phase in which new knowledge is created. Knowledge is very distinct from data and information (see Figure 2.1). Whereas data is facts, measurements, and statistics, information is organized or processed data that is timely (i.e., inferences from the data are drawn within a particular time frame) and understandable (i.e., with regard to the original data). Knowledge is information that is contextual, relevant, and actionable. For example, a map that gives detailed driving directions from one location to another could be considered data. An up-to-the-minute traffic bulletin along the freeway that indicates a traffic slowdown due to construction several miles ahead could be considered information. Awareness of an alternative, back-roads route could be considered knowledge. In this case, the map is considered data because it does not contain current relevant information that affects the driving time and conditions from one location to the other. However, having the current conditions as information is useful only if you have knowledge that enables you to avoid the construction zone. The implication is that knowledge has strong experiential and reflective elements that distinguish it from information in a given context.
Having knowledge implies that it can be exercised to solve a problem, whereas having information does not carry the same connotation. An ability to act is an integral part of being knowledgeable. For example, two people in the same context with the same information may not have the same ability to use the information with the same degree of success. Hence, there is a difference in the human capability to add value. Such differences in ability may be due to different experiences, different training, different perspectives, and other factors. Whereas data, information, and knowledge may all be viewed as assets of an organization, knowledge provides a higher level of meaning about data and information. Because knowledge conveys meaning, it tends to be much more valuable—yet more ephemeral.
Although the term data mining is relatively new to many people, the ideas behind it are not. Many of the techniques used in data mining have roots in traditional statistical analysis and artificial intelligence work done since the early part of 1950s. Why, then, has data mining suddenly gained the attention of the business world? Following are some of the most common reasons:
More intense competition at the global scale. There are more suppliers than there is demand to satisfy everybody.
Constantly changing needs and wants of the customers. Because of the increasing number of suppliers and their offerings (e.g., higher quality, lower cost, faster service), customer demand is changing in a dramatic fashion.
Recognition of the value of data. Businesses are now aware of the untapped value hidden in large data sources.
Changing culture of management. Data-driven, evidence-based decision making is becoming a common practice, significantly changing the way managers work.
Improved data capture and storage techniques. Collection and integration of data from a variety of sources into standardized data structures enable businesses to have quality data about customers, vendors, and business transactions.
Emergence of data warehousing. Databases and other data repositories are being consolidated into a single location in the form of a data warehouse to support analytics and managerial decision making.
Technological advancements in hardware and software. Processing and storage capabilities of computing devices are increasing exponentially.
Cost of ownership. While capabilities are increasing, the costs of hardware and software for data storage and processing are rapidly decreasing.
Availability of data. Living in the age of the Internet brings new opportunities to analytically savvy businesses to identify and tap into very large and information-rich data sources (e.g., social networks, social media) to gain better understandings.
Data is everywhere. For instance, data generated by Internet-based activities is increasing rapidly, reaching volumes that we did not even have specific names for in the very recent past. Large amounts of genomic data and related information (in the form of publications and research findings published in journal articles and other media outlets) are being generated and accumulated all over the world. Disciplines such as astronomy and nuclear physics create huge quantities of data on a regular basis. Medical and pharmaceutical researchers constantly generate and store data that can then be used in data mining applications to identify better ways to diagnose and treat illnesses and to discover new and improved drugs. On the commercial side, perhaps the most common use of data and data mining has been in the finance, retail, and health care sectors. Data mining is used to detect and reduce fraudulent activities, especially in insurance claims and credit card use; to identify customer buying patterns; to reclaim profitable customers; to identify trading rules from historical data; and to aid in increased profitability using market-basket analysis. Data mining is already widely used to better target clients, and with the widespread development of ecommerce, this can only become more imperative with time.
What Is Data Mining?
At the most basic level, data mining can be defined as the process of discovering (i.e., mining) knowledge (i.e., actionable information) from large amounts of data. When you really think about it, you realize that the term data mining is not a proper description of what’s really happening; that is, mining of gold from within rocks and dirt is referred to as gold mining rather than rock or dirt mining. Therefore, data mining perhaps should have been named knowledge mining or knowledge discovery. Despite the mismatch between the term and its meaning, data mining has become the term of choice in the community at large. Several other names—including knowledge discovery in databases, information extraction, pattern analysis, information harvesting, and pattern searching, among others—have been suggested to replace data mining, but none has gained any significant traction.
Data mining is a process that involves using statistical, mathematical, and artificial intelligence techniques and algorithms to extract and identify useful information and subsequent knowledge (or patterns) from large sets of data. These patterns can be in the form of business rules, affinities, correlations, trends, or predictions. Fayyad et al. (1996) defined data mining as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases,” where the data is organized in records structured by categorical, ordinal, and continuous variables. The key terms in this definition have the following meanings:
Process implies that data mining comprises many iterative steps.
Nontrivial means that some experimentation-type search or inference is involved; that is, it is not as straightforward as a computation of predefined quantities.
Valid means that the discovered patterns should hold true on new data with a sufficient degree of certainty.
Novel means that the patterns were not previously known to the user in the context of the system being analyzed.
Potentially useful means that the discovered patterns should lead to some benefit to the user or task.
Ultimately understandable means that the pattern should make business sense that leads to users saying “This makes sense. Why didn’t I think of that?”—if not immediately at least after some processing.
Data mining is not a new discipline but rather a new approach in the intersection of several other scientific disciplines. To some extent, data mining is a new philosophy that suggests the use of data and mathematical models to create/discover new knowledge. Data mining leverages capabilities of other disciplines, including statistics, artificial intelligence, machine learning, management science, information systems, and databases, in a systematic and synergistic way (see Figure 2.2). Using collective capabilities of these scientific disciplines, data mining aims to make progress in extracting useful information and knowledge from large data repositories. It is an emerging field that has attracted much attention in a very short time and has fueled the emergence and popularity of the analytics movement.