- Yep, Your Data Is Dirty
- Quick-and-Dirty Data Cleaning
- Third-Party Utilities
- Fuzzy Stuff: SQL Server 2005 and Integration Services
- Other Data-Cleaning Products
- Third-Party Cleaning
- Taking the Time To Get It Right
Fuzzy Stuff: SQL Server 2005 and Integration Services
SQL Server 2005 includes a feature called Integration Services for transforming data, including cleaning it. SQL Server Integration Services (SSIS) uses a workflow model with a (mostly) drag-and-drop GUI to let you move data easily through the ETL process.
The transforming elements of SSIS are called packages; you transform data by stringing together the appropriate packages. SSIS comes with a number of packages, and others are available from third parties. Your programmers can also write their own packages.
Two especially important packages in SSIS are Fuzzy Lookup and Fuzzy Grouping, which will return the best match for a given record. Essentially, Fuzzy Lookup and Fuzzy Grouping work by breaking each selected field into user-defined sub-fields (such as the street name and address number in an address field), comparing them against a preset list, and judging how similar they are by using a user-defined similarity value. Using fuzzy logic, the packages report best matches as well as how similar the record is to the matched model.
Of course, this capability doesn’t come cheap. Fuzzy Lookup and Fuzzy Grouping have to perform calculations on every selected column in every entry (row). This activity takes computing power—and, as a result, time. Microsoft warns that using these features will take some time to produce results, especially when building the initial list of possibilities.