Popular Data Mining Tools
A large number of software vendors provide powerful data mining tools. Some of these vendors are providers of only data mining and statistical analysis software, while others are large firms that provide wide ranges of software and hardware, along with consulting, in addition to data mining software products. Examples of the vendors that provide data mining tools include IBM (IBM SPSS Modeler, formerly known as SPSS PASW Modeler and Clementine), SAS (Enterprise Miner), StatSoft (Statistica Data Miner—now a TIBCO company), KXEN (Infinite Insight—now a SAP company), Salford (CART, MARS, TreeNet, and RandomForest), Angoss (KnowledgeSTUDIO and KnowledgeSeeker), and Megaputer (PolyAnalyst). Noticeably but not surprisingly, the most popular data mining tools were originally developed by the well-established statistical software companies (SPSS, SAS, and StatSoft). This is largely because statistics is the foundation of data mining, and these companies have the means to cost-effectively develop them into full-scale data mining systems.
Most of the business intelligence (BI) tool vendors (e.g., IBM Cognos, Oracle Hyperion, SAP Business Objects, Microstrategy, Teradata, Microsoft) also have some level of data mining capabilities integrated into their software offerings. These BI tools are still primarily focused on descriptive analytics in the sense of multidimensional modeling and data visualization and are not considered to be direct competitors of the data mining tool vendors.
In addition to the commercial data mining tools, there are several open source and/or free data mining software tools available over the Internet. Historically, the most popular free (and open source) data mining tool is Weka, which was is developed by several researchers from the University of Waikato in New Zealand (and can be downloaded from cs.waikato.ac.nz/ml/weka/). Weka includes a large number of algorithms for different data mining tasks and has an intuitive user interface. Another quickly popularized free (for noncommercial use) data mining tool is RapidMiner, developed by RapidMiner.com (which can be downloaded from rapidminer.com). Its graphically enhanced user interface, use of a rather large number of algorithms, and incorporation of a variety of data visualization features set it apart from the rest of the other free data mining tools.
Another free and open source data mining tool with an appealing workflow-type graphical user interface is KNIME Analytics Platform (which can be downloaded from knime.org). A detailed description of KNIME can be found in Appendix A.
The main difference between commercial tools, such as Enterprise Miner, IBM SPSS Modeler, and Statistica, and free tools, such as Weka, RapidMiner, and KNIME, is often the computational efficiency. The same data mining task involving a large data set may take a lot longer to complete with free software, and for some algorithms, it may not even complete (i.e., it may crash due to inefficient use of computer memory). With the cloud-based analytics, this deficiency of open source tools is no longer as prominent as it used to be. For instance, an analytics model can be developed with a small data sample in KNIME Analytics Platform and then deployed and executed on the cloud platform with the complete/large dataset. In addition to software tools, code-based analytics tools and high-level programming languages (i.e., Python, R, JavaScript) are also gaining tremendous popularity in the world of analytics and data science. Table 2.1 lists the major data mining software products and their websites.
Table 2.1 Popular Data Mining Software Tools
Product |
Website (URL) |
---|---|
KNIME Analytics Platform |
|
SAS Enterprise Miner |
|
IBM SPSS Modeler |
|
TIBCO Statistica |
|
RapidMiner |
|
PolyAnalyst |
|
Salford Predictive Modeler |
|
XLMiner |
|
DataRobot Enterprise Analytics and AI |
|
Databricks Unified Analytics Platform |
|
Apache Spark Analytics |
|
H2O Analytics |
|
Teradata Warehouse Miner |
|
Oracle Data Mining |
oracle.com/database/technologies/advanced-analytics/odm.html |
R for Analytics |
|
Python for Analytics |
|
Open Source Analytical API Platform for JavaScript |
Microsoft’s SQL Server includes a suite of business intelligence capabilities that has become increasingly popular for data mining studies. With SQL Server, data and analytic models are hosted in the same relational database environment, significantly increasing the efficiency of model execution while making model management a considerably easier task. The Microsoft Enterprise Consortium serves as the worldwide source for access to Microsoft’s SQL Server software suite for academic purposes (i.e., teaching and research). The consortium was established to enable universities around the world to access enterprise technology without having to maintain the necessary hardware and software on their own campuses. The consortium provides a wide range of business intelligence development tools (e.g., data mining, cube building, business reporting) as well as a number of large, realistic data sets from companies such as Sam’s Club, Dillard’s, and Tyson Foods. The Microsoft Enterprise Consortium is free of charge and can be used only for academic purposes. The Sam M. Walton College of Business at the University of Arkansas hosts the enterprise system and allows consortium members and their students to access these resources using a simple remote desktop connection. For details about becoming a part of the consortium and easy-to-follow tutorials and examples, see https://walton.uark.edu/enterprise/Microsoft/index.php.
In May 2019, KDnuggets (a well-known web portal for data mining and analytics links and resources) conducted its 20th annual software poll on the following question: “What analytics, data science, machine learning software/tools have you used in the last three years (2017–2019) for a real project?” This poll received huge attention from the analytics and data science community, attracting more than 1,800 unique voters. The poll measures both how widely a data analytics/data science software tool is used and how strongly the vendors advocate for their tool. Here are some of the interesting findings that came out of the poll:
Many business analytics and data science software users use more than one tool to carry out data analytics projects. According to the poll, the average number of tools used by a person or vendor was 6.1 in 2019 (compared to 3.7 in 2014). This is a clear indication that most data scientists use a combination of tools (commercial, free/open source software, programming languages, and open access algorithms and model libraries as community projects). Using only one tool or language seems to be insufficient to deal with the requirements of the new generation of analytics projects.
The popularity of free and open-source software tools and programming languages far exceeded that of the commercial tools. More than two-thirds of the most popular tools in the top 40 are either free/open source software with graphical user interfaces or programming languages and libraries of models/algorithms for data analytics. Overall, the most popular tool was Python (with 65% of the votes), as was the case in 2018.
While the percentage of votes for big data tools (e.g., Apache Spark, Hadoop, Kafka) and technologies decreased, the deep learning tools, technologies, and libraries (e.g., TensorFlow, Keras, PyTorch) gained significant popularity.
Figure 2.4 shows the results of the poll for tools that placed in the top 40, based on the number of unique votes they received. The chart in this figure shows the number of votes for each of these tools.
Figure 2.4 Most Popular Data Mining and Analytics Software Tools (User Poll Results)
Source: Used with permission of kdnuggets.com.
To reduce bias through multiple voting, in this poll, KDnuggets used email verification, which may potentially reduce the total number of votes but made the results less biased and more representative.