What Data Mining Is Not
Because of its appeal, many have used the term data mining to refer to any analysis that has something to do with data. For instance, people may refer to casual Internet searching as data mining. It’s true that Internet searching is a way to mine large and diverse data/information sources to find (i.e., discover) evidence for a specific question/query—and because of that, it may seem like data mining. However, data mining is the process of discovering reusable patterns through the application of statistical and/or machine learning techniques. Therefore, data mining is much more rigorous and scientific than simply querying the Internet.
Another concept that data mining is often confused with is OLAP (online analytical processing). As the core enabler of the business intelligence movement, OLAP is a collection of database-querying methods for searching very large databases (or data warehouses) through the use of data cubes. Cubes are multidimensional representations of the data stored in data warehouses. With the use of cubes, OLAP helps decision makers to slice and dice organizational data for the purpose of finding answers to “What happened?” “Where did it happen?” and “When did it happen?” types of questions. As sophisticated as it may sound—and perhaps from the standpoint of efficiency, it actually is that sophisticated—OLAP is not data mining. It may be a precursor to data mining. In a sense, they complement each other in that they convert data into information and knowledge for better and faster decision making. Whereas OLAP is part of descriptive analytics, data mining is an essential part of predictive analytics.
There has been much discussion about statistics and data mining; some people suggest that data mining is a part of statistics and others propose that statistics is a part of data mining. Yet others suggest that data mining and statistics are the same thing. Even though we cannot get to the bottom of that discussion here, we can at least shed some light on it by mentioning a few critical points. Data mining and statistics have a lot in common. They both look for relationships in data. The main difference between the two is that statistics starts with a well-defined proposition and hypothesis, whereas data mining starts with a loosely defined discovery statement. Statistics collects a sample of data (i.e., primary data) to test the hypothesis, whereas data mining and analytics use existing data (i.e., often observational, secondary data) to discover novel patterns and relationships. Another difference is in the size of data that they use. Data mining looks for data sets that are as “big” as possible, while statistics looks for right size of data and, if the data is larger than what is needed or required for the statistical analysis, it uses a sample of the data. Statistics and data mining have different definitions of what constitutes “large data”: Whereas a few hundred to a thousand data points is large enough for a statistician, several million to a few billion data points is considered large for data mining studies.
In summary, data mining is not a simple Internet search or routine application of OLAP, and it’s not the same as statistics. Even though it uses capabilities of these descriptive techniques, data mining is the next level in the analytics hierarchy, where interesting patterns (relationships and future trends) are discovered with the use of data and models.