A Design Technique: Data Integration Modeling
- The Business Case for a New Design Process
- Improving the Development Process
- Overview of Data Integration Modeling
- Conceptual Data Integration Models
- Logical Data Integration Models
- Physical Data Integration Models
- Tools for Developing Data Integration Models
- Industry-Based Data Integration Models
- Summary
- End-of-Chapter Questions
The Business Case for a New Design Process
There is a hypothesis to the issue of massive duplication of data integration processes, which is as follows:
One of the main reasons why there is massive replication of data integration processes in many organizations is the fact that there is no visual method of "seeing" what data integration processes currently exist and what is needed. This is similar to the problem that once plagued the data modeling discipline.
In the early 1980s, many organizations had massive duplication of customer and transactional data. These organizations could not see the "full picture" of their data environment and the massive duplication. Once organizations began to document and leverage entity-relationship diagrams (visual representations of a data model), they were able to see the massive duplication and the degree of reuse of existing tables increased as unnecessary duplication decreased.
The development of data integration processes is similar to those in database development. In developing a database, a blueprint, or model of the business requirements, is necessary to ensure that there is a clear understanding between parties of what is needed. In the case of data integration, the data integration designer and the data integration developer need that blueprint or project artifact to ensure that the business requirements in terms of sources, transformations, and targets that are needed to move data have been clearly communicated via a common, consistent approach. The use of a process model specifically designed for data integration will accomplish that requirement.
Figure 3.1 depicts the types of data models needed in a project and how they are similar to those that could be developed for data integration.
Figure 3.1 Modeling paradigm: data and data integration
The usual approach for analyzing, designing, and building ETL or data integration processes on most projects involves a data analyst documenting the requirements for source-to-target mapping in Microsoft® Excel® spreadsheets. These spreadsheets are given to an ETL developer for the design and development of maps, graphs, and/or source code.
Documenting integration requirements from source systems and targets manually into a tool like Excel and then mapping them again into an ETL or data integration package has been proven to be time-consuming and prone to error. For example:
- Lost time—It takes a considerable amount of time to copy source metadata from source systems into an Excel spreadsheet. The same source information must then be rekeyed into an ETL tool. This source and target metadata captured in Excel is largely nonreusable unless a highly manual review and maintenance process is instituted.
- Nonvalue add analysis—Capturing source-to-target mappings with transformation requirements contains valuable navigational metadata that can be used for data lineage analysis. Capturing this information in an Excel spreadsheet does not provide a clean automated method of capturing this valuable information.
- Mapping errors—Despite our best efforts, manual data entry often results in incorrect entries, for example, incorrectly documenting an INT data type as a VARCHAR in an Excel spreadsheet will require a data integration designer time to analyze and correct.
- Lack of standardization: inconsistent levels of detail—The data analysts who perform the source-to-target mappings have a tendency to capture source/transform/target requirements at different levels of completeness depending on the skill and experience of the analyst. When there are inconsistencies in the level of detail in the requirements and design of the data integration processes, there can be misinterpretations by the development staff in the source-to-target mapping documents (usually Excel), which often results in coding errors and lost time.
- Lack of standardization: inconsistent file formats—Most environments have multiple extracts in different file formats. The focus and direction must be toward the concept of read once, write many, with consistency in extract, data quality, transformation, and load formats. The lack of a standardized set of extracts is both a lack of technique and often a result of a lack of visualization of what is in the environment.
To improve the design and development efficiencies of data integration processes, in terms of time, consistency, quality, and reusability, a graphical process modeling design technique for data integration with the same rigor that is used in developing data models is needed.