Introduction to Predictive Analytics and Data Mining

By Dursun Delen
Dec 21, 2020

␡

What Is Data Mining?
What Data Mining Is Not
The Most Common Data Mining Applications
What Kinds of Patterns Can Data Mining Discover?
Popular Data Mining Tools
The Dark Side of Data Mining: Privacy Concerns
Summary
References

⎙ Print

< Back Page 5 of 8 Next >

Predictive Analytics: Data Mining, Machine Learning and Data Science for Practitioners, 2nd Edition, 2nd Edition

Learn More Buy

Popular Data Mining Tools

A large number of software vendors provide powerful data mining tools. Some of these vendors are providers of only data mining and statistical analysis software, while others are large firms that provide wide ranges of software and hardware, along with consulting, in addition to data mining software products. Examples of the vendors that provide data mining tools include IBM (IBM SPSS Modeler, formerly known as SPSS PASW Modeler and Clementine), SAS (Enterprise Miner), StatSoft (Statistica Data Miner—now a TIBCO company), KXEN (Infinite Insight—now a SAP company), Salford (CART, MARS, TreeNet, and RandomForest), Angoss (KnowledgeSTUDIO and KnowledgeSeeker), and Megaputer (PolyAnalyst). Noticeably but not surprisingly, the most popular data mining tools were originally developed by the well-established statistical software companies (SPSS, SAS, and StatSoft). This is largely because statistics is the foundation of data mining, and these companies have the means to cost-effectively develop them into full-scale data mining systems.

Most of the business intelligence (BI) tool vendors (e.g., IBM Cognos, Oracle Hyperion, SAP Business Objects, Microstrategy, Teradata, Microsoft) also have some level of data mining capabilities integrated into their software offerings. These BI tools are still primarily focused on descriptive analytics in the sense of multidimensional modeling and data visualization and are not considered to be direct competitors of the data mining tool vendors.

In addition to the commercial data mining tools, there are several open source and/or free data mining software tools available over the Internet. Historically, the most popular free (and open source) data mining tool is Weka, which was is developed by several researchers from the University of Waikato in New Zealand (and can be downloaded from cs.waikato.ac.nz/ml/weka/). Weka includes a large number of algorithms for different data mining tasks and has an intuitive user interface. Another quickly popularized free (for noncommercial use) data mining tool is RapidMiner, developed by RapidMiner.com (which can be downloaded from rapidminer.com). Its graphically enhanced user interface, use of a rather large number of algorithms, and incorporation of a variety of data visualization features set it apart from the rest of the other free data mining tools.

Another free and open source data mining tool with an appealing workflow-type graphical user interface is KNIME Analytics Platform (which can be downloaded from knime.org). A detailed description of KNIME can be found in Appendix A.

The main difference between commercial tools, such as Enterprise Miner, IBM SPSS Modeler, and Statistica, and free tools, such as Weka, RapidMiner, and KNIME, is often the computational efficiency. The same data mining task involving a large data set may take a lot longer to complete with free software, and for some algorithms, it may not even complete (i.e., it may crash due to inefficient use of computer memory). With the cloud-based analytics, this deficiency of open source tools is no longer as prominent as it used to be. For instance, an analytics model can be developed with a small data sample in KNIME Analytics Platform and then deployed and executed on the cloud platform with the complete/large dataset. In addition to software tools, code-based analytics tools and high-level programming languages (i.e., Python, R, JavaScript) are also gaining tremendous popularity in the world of analytics and data science. Table 2.1 lists the major data mining software products and their websites.

Table 2.1 Popular Data Mining Software Tools

Product	Website (URL)
KNIME Analytics Platform	knime.org
SAS Enterprise Miner	https://www.sas.com/en_us/software/enterprise-miner.html
IBM SPSS Modeler	ibm.com/products/spss-modeler
TIBCO Statistica	docs.tibco.com/products/tibco-statistica
RapidMiner	rapidminer.com
PolyAnalyst	megaputer.com/polyanalyst.php
Salford Predictive Modeler	salford-systems.com
XLMiner	https://www.solver.com/xlminer-platform
DataRobot Enterprise Analytics and AI	https://www.datarobot.com/
Databricks Unified Analytics Platform	https://databricks.com/
Apache Spark Analytics	https://spark.apache.org/
H2O Analytics	https://h2oanalytics.com/
Teradata Warehouse Miner	https://www.teradata.com/
Oracle Data Mining	oracle.com/database/technologies/advanced-analytics/odm.html
R for Analytics	https://www.r-project.org/
Python for Analytics	https://www.python.org/
Open Source Analytical API Platform for JavaScript	https://cube.dev/

Microsoft’s SQL Server includes a suite of business intelligence capabilities that has become increasingly popular for data mining studies. With SQL Server, data and analytic models are hosted in the same relational database environment, significantly increasing the efficiency of model execution while making model management a considerably easier task. The Microsoft Enterprise Consortium serves as the worldwide source for access to Microsoft’s SQL Server software suite for academic purposes (i.e., teaching and research). The consortium was established to enable universities around the world to access enterprise technology without having to maintain the necessary hardware and software on their own campuses. The consortium provides a wide range of business intelligence development tools (e.g., data mining, cube building, business reporting) as well as a number of large, realistic data sets from companies such as Sam’s Club, Dillard’s, and Tyson Foods. The Microsoft Enterprise Consortium is free of charge and can be used only for academic purposes. The Sam M. Walton College of Business at the University of Arkansas hosts the enterprise system and allows consortium members and their students to access these resources using a simple remote desktop connection. For details about becoming a part of the consortium and easy-to-follow tutorials and examples, see https://walton.uark.edu/enterprise/Microsoft/index.php.

In May 2019, KDnuggets (a well-known web portal for data mining and analytics links and resources) conducted its 20th annual software poll on the following question: “What analytics, data science, machine learning software/tools have you used in the last three years (2017–2019) for a real project?” This poll received huge attention from the analytics and data science community, attracting more than 1,800 unique voters. The poll measures both how widely a data analytics/data science software tool is used and how strongly the vendors advocate for their tool. Here are some of the interesting findings that came out of the poll:

Many business analytics and data science software users use more than one tool to carry out data analytics projects. According to the poll, the average number of tools used by a person or vendor was 6.1 in 2019 (compared to 3.7 in 2014). This is a clear indication that most data scientists use a combination of tools (commercial, free/open source software, programming languages, and open access algorithms and model libraries as community projects). Using only one tool or language seems to be insufficient to deal with the requirements of the new generation of analytics projects.
The popularity of free and open-source software tools and programming languages far exceeded that of the commercial tools. More than two-thirds of the most popular tools in the top 40 are either free/open source software with graphical user interfaces or programming languages and libraries of models/algorithms for data analytics. Overall, the most popular tool was Python (with 65% of the votes), as was the case in 2018.
While the percentage of votes for big data tools (e.g., Apache Spark, Hadoop, Kafka) and technologies decreased, the deep learning tools, technologies, and libraries (e.g., TensorFlow, Keras, PyTorch) gained significant popularity.

Figure 2.4 shows the results of the poll for tools that placed in the top 40, based on the number of unique votes they received. The chart in this figure shows the number of votes for each of these tools.

Figure 2.4 Most Popular Data Mining and Analytics Software Tools (User Poll Results)

Source: Used with permission of kdnuggets.com.

To reduce bias through multiple voting, in this poll, KDnuggets used email verification, which may potentially reduce the total number of votes but made the results less biased and more representative.

< Back Page 5 of 8 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Introduction to Predictive Analytics and Data Mining

This chapter is from the book

This chapter is from the book

This chapter is from the book 

Popular Data Mining Tools

Table 2.1 Popular Data Mining Software Tools

InformIT Promotional Mailings & Special Offers