The Best Data Cleaning Utilities for Windows Systems
Yep, Your Data Is Dirty
In a perfect world, all the data we use would be correct as we receive it. Addresses would be complete, products would be uniquely identified, and everyone’s numbers would match up.
To get an idea of how imperfect the world is, consider a simple statistic. Surveys by firms such as Knowledge Discovery show that more than 80% of the workers on data warehouse projects spend at least 40% of their time involved in data cleaning issues. If you’re dealing with a collection of data, sooner or later you’re going to have to address the issues of cleaning the data to make sure that it’s correct and consistent.
The problems with data are many, varied, and often just plain maddening. Many of them arise because in their native state computers don’t have the concept of "similar." If two items don’t match exactly, to the computer they’re different. Another concept computers lack in their native state is "reasonable." Unless specific checks are made, they’ll accept any value for data. Humans are much less picky in both areas, and humans are almost always the originating source for the data.
The result is some amazing variations. One company that maintained a database of 14 million businesses in the United States turned up no less than 17 different ways to spell McDonald’s (the name of the hamburger chain). Here are some of the common variations:
- McDonalds
- Mc Donalds
- MacDonalds
- Mcdonald’s
- Macdonalds
Of course, each of those variations appeared as a different business in the uncleaned data.
Or consider an inventory with entries for hamburger, ground round, grnd rnd, and such. Until you have to deal with data from multiple sources, you wouldn’t believe the number of different ways in which people can list the same item.
A related problem is that we often have to combine data from different sources. This data generally was collected by different people or different organizations, for different purposes, and not all the fields in all those data entries were equally important to the people collecting the data. Even assuming that the formatting of the data is consistent—or, more likely, has been corrected in a separate step—you can assume that the more sources for your data, the more it will need cleaning.
Data cleaning jobs vary from a list of a few hundred entries maintained by a Windows administrator as a spare-time task to elaborate data warehouses with millions or tens of millions of entries, so the tools have to vary as well. Some data cleaning tools are simple, inexpensive utilities that anyone can use, and some are costly packages that take months to implement and require highly trained operators. In addition to such specialist utilities, many programs, such as some databases and statistics packages, include data cleaning utilities.
Although data cleaning, particularly on a small scale, is often done as a standalone operation, it’s commonly regarded in the context of Extraction, Transformation, and Loading (ETL) of data. Data cleaning, along with format conversion and similar jobs, is part of the transformation process after the data is pulled from whatever program(s) generated it, and before the data is put into the software that will actually use it.
Several categories of data cleaning tools are available for Windows systems, depending on the nature of the data, the amount of data, and what kind of operation is needed. A number of utilities are designed specifically to check address lists. The major mailing-list companies also provide services to clean address lists and check them against their records. There are even specialized tools available for particular kinds of data, such as the FBI’s Uniform Crime Report database or topographic data. A good portion of success in data cleaning comes from choosing the appropriate tool.
No utility can do everything. Many of the questionable entries found by the software will have to be examined and resolved by a knowledgeable human. However, the right software can help you to improve the accuracy of your data enormously.