Data Quality Improvement Practices
Many organizations still sidestep long-term data quality improvement practices in favor of achieving short-term goals. However, an increasing number of organizations realize that the consequences of not addressing the poor quality of data may result in adverse effects, such as customer attrition or severe loss in market share. Analyst firms, such as the Gartner Group, have warned of consequences as grave as total business failures.
Data Profiling
The first step in improving data quality is to uncover your data defects through data profiling, sometimes called data archeology, which is the process of analyzing the data for correctness, completeness, uniqueness, consistency, and reasonability. Once a difficult and tedious task requiring dozens of SQL and 4GL/5GL programs searching through every record on every file or database to find data anomalies, data profiling, data cleansing tools now have the capability to profile the data for you.
Similarly, you may be able to leverage some functions of your data mining tool to assess your data quality. For example, Teradata’s data mining tool Warehouse Miner has two functions that can be used for source data analysis. Their "values analysis" function identifies characteristics of the data values, such as ZEROs, NULLs, and number of unique values, whereas their "overlap analysis" function identifies the number of overlapping keys that the tables share, which is helpful for data mart consolidation. Histograms and scatter plots allow you to visually detect outliers. In addition, the SQL generated by the tool can be run against the entire database to quickly differentiate the aberrant value deviations from the norm.
Data Cleansing
After the extent of "dirty data" is known, the easiest place to start the data quality improvement process is by cleansing operational data at the time it is moved into DW databases where it is used for cross-organizational reporting. However, data cleansing is a labor-intensive, time-consuming, and expensive process, and cleansing all the data is usually neither cost-justified nor practical. On the other hand, cleansing none of the data is equally unacceptable. It is therefore important to carefully analyze the source data and to classify the data elements as critical, important, or insignificant to the business. Then, concentrate on cleansing all the critical data elements, and as time permits, cleanse as many of the important data elements as practical, leaving the insignificant data elements unchanged. In other words, you do not need to cleanse all the data, and you do not need to do it all at once.
Another factor that will influence your ability to cleanse the data is whether the correct data still exists or whether it can be recreated with a minimal amount of manual or automated effort. There are situations where values are so convoluted or disparate—even with different and opposing meanings to the same fact—that any attempt to decipher such data might produce even worse results. In that case, it might be best to just leave the data alone.
Another decision to make is how to cleanse what can reasonably be cleansed. Can the data cleansing products on the market today handle most of the common data quality problems? The answer is yes. Are the data cleansing and extract/transform/load (ETL) products on the market capable of resolving all of the complicated and unique "dirty data" situations on all of your platforms, and will they ever be? The answer is probably no. Therefore, if you are truly serious about creating value-added information out of the dirty data, then you will probably have to invest in writing some procedural code to supplement the capabilities of your tools.
Data Defect Prevention
The next decision to make is how to prevent future "dirty data" from being entered. That begins by identifying the root causes for the data defects, which can be a combination of the following:
- Defective program logic
- Not enough program edits
- Not understanding the meaning of a data element
- No common metadata
- No domain definitions
- No reconciliation process
- No data verification process
- Poor data entry training
- Inadequate time for data entry
- No incentive for quality data entry
The owners of the operational systems should plan to improve their programs and edit checks, unless the effort is unreasonably high. For example, if the corrective action requires changing the file structure, which means modifying (if not rewriting) most of the programs that access that file, then the cost for such an invasive corrective action on the operational system is probably not justifiable—especially if the bad data does not interfere with the operational needs of that system. This type of decision cannot—and should not—be made by IT alone. Downstream information consumers must negotiate with the data originators about justifying and prioritizing the data quality improvement steps.
A data governance group should be established at the enterprise level, which should be staffed with data administrators, metadata administrators, and data quality stewards:
- Data administrators—These people are responsible for the enterprise logical data model, for establishing and maintaining naming standards, and for capturing data-related business rules.
- Metadata administrators—These people are responsible for loading, linking, managing, and disseminating metadata to facilitate the common understanding of data and to encourage data reuse. Metadata is the contextual information about the data. Metadata components include data names, data definitions, business rules, data content (domains), data type, data length, data owner, data transformations, degree of cleanliness, and so on.
- Data quality stewards—These people are charged with preventing the propagation of inferior quality data throughout the enterprise, and thus, the decision-making processes. Therefore, it is their responsibility to perform regular data audits on business data, metadata, and data models, and to be involved in data reconciliation efforts by helping to identify and resolve the root causes of data quality issues. The findings of the audits and reconciliation efforts should feed back into a continuous data quality improvement cycle.
Data quality training should be instituted to address poor data entry habits. Not all data rules can be enforced through edit checks or by the features of relational databases, such as strong data typing, referential integrity, use of look-up tables, and the use of stored edit procedures. Many data violations can still occur because of human error, negligence, or intentionally introduced errors. For example, if an end user needs a new data element but must wait six months for IT to change the database, then the end user might simply decide to overload an existing column and use it for dual (or triple) purposes, such as putting the date of the last promotion into the Account Closed Date column.