- The Business Case for a New Design Process
- Improving the Development Process
- Overview of Data Integration Modeling
- Conceptual Data Integration Models
- Logical Data Integration Models
- Physical Data Integration Models
- Tools for Developing Data Integration Models
- Industry-Based Data Integration Models
- Summary
- End-of-Chapter Questions
Physical Data Integration Models
The purpose of a physical data integration model is to produce a detailed representation of the data integration specifications at the component level within the targeted data integration technology.
A major concept in physical data integration modeling is determining how to best take the logical design and apply design techniques that will optimize performance.
Converting Logical Data Integration Models to Physical Data Integration Models
As in data modeling where there is a transition from logical to physical data models, the same transition occurs in data integration modeling. Logical data integration modeling determines what extracts, data quality, transformations, and loads. Physical data integration leverages a target-based design technique, which provides guidelines on how to design the "hows" in the physical data integration models to ensure that the various components will perform optimally in a data integration environment.
Target-Based Data Integration Design Technique Overview
The target-based data integration design technique is an approach that creates physical data integration components based on the subject area loads and the source systems that populate those subject areas. It groups logical functionality into reusable components based on the data movement patterns of local versus enterprise usage within each data integration model type.
For example, in most data integration processes, there are source system-level and enterprise-level data quality checks. The target-based technique places that functionality either close to the process that will use it (in this case, the extract process) or groups enterprise capabilities in common component models.
For example, for source system-specific data quality checks, the target-based technique simply moves that logic to the extract processes while local transformations are moved to load processes and while grouping enterprise-level data quality and transformations are grouped at the common component level. This is displayed in Figure 3.11.
Figure 3.11 Distributing logical functionality between the "whats" and "hows"
The target-based data integration design technique is not a new concept: Coupling and cohesion, modularity, objects, and components are all techniques used to group "stuff" into understandable and highly functional units of work. The target-based technique is simply a method of modularizing core functionality within the data integration models.
Physical Source System Data Integration Models
A source system extract data integration model extracts the data from a source system, performs source system data quality checks, and then conforms that data into the specific subject area file formats, as shown in Figure 3.12.
Figure 3.12 Physical source system extract data integration model example
The major difference in a logical extract model from a physical source system data integration model is a focus on the final design considerations needed to extract data from the specified source system.
Designing an Extract Verification Process
The data from the source system files is extracted and verified with a control file. A control file is a data quality check that verifies the number of rows of data and a control total (such as loan amounts that are totaled for verification for a specific source extract as an example).
It is here where data quality rules that are source system-specific are applied. The rationale for applying source system-specific data quality rules at the particular source system rather than in one overall data quality job is to facilitate maintenance and performance. One giant data quality job becomes a maintenance nightmare. It also requires an unnecessary amount of system memory to load all data quality processes and variables that will slow the time for overall job processing.
Cross-system dependencies should be processed in this model. For example, associative relationships for connecting agreements together should be processed here.
Physical Common Component Data Integration Models
The physical common component data integration model contains the enterprise-level business data quality rules and common transformations that will be leveraged by multiple data integration applications. This layer of the architecture is a critical focal point for reusability in the overall data integration process flow, with particular emphasis on leveraging existing transformation components. Any new components must meet the criteria for reusability.
Finally, in designing common component data integration models, the process flow is examined on where parallelism can be built in to the design based on expected data volumes and within the constraints of the current data integration technology.
Common Component Data Quality Data Integration Models
Common component data quality data integration models are generally very "thin" (less functionality) process models, with enterprise-level data quality rules. Generally, source system-specific data quality rules are technical in nature, whereas business data quality rules tend to be applied at the enterprise level.
For example, gender or postal codes are considered business rules that can be applied as data quality rules against all data being processed. Figure 3.13 illustrates an example of a common data quality data integration model.
Figure 3.13 Common components—data quality data integration model example
Note that the source-specific data quality rules have been moved to the physical source system extract data integration model and a thinner data quality process is at the common component level. Less data ensures that the data flow is not unnecessarily constrained and overall processing performance will be improved.
Common Component Transformation Data Integration Models
Most common transforms are those that conform data to an enterprise data model. Transformations needed for specific aggregations and calculations are moved to the subject area loads, or where they are needed, which is in the subject areas that the data is being transformed.
In terms of enterprise-level aggregations and calculations, there are usually very few; most transformations are subject-area-specific. An example of a common component-transformation data integration subject area model is depicted in Figure 3.14.
Figure 3.14 Common components—transform data integration model example
Please note that the aggregations for the demand deposit layer have been removed from the common component model and have been moved to the subject area load in line with the concept of moving functionality to where it is needed.
Physical Subject Area Load Data Integration Models
A subject area load data integration model logically groups "target tables" together based on subject area (grouping of targets) dependencies and serves as a simplification for source system processing (layer of indirection).
A subject area load data integration model performs the following functions:
- Loads data
- Refreshes snapshot loads
- Performs Change Data Capture
It is in the subject area load data integration models where primary and foreign keys will be generated, referential integrity is confirmed, and Change Data Capture is processed.
In addition to the simplicity of grouping data by subject area for understandability and maintenance, grouping data by subject area logically limits the amount of data carried per process because it is important to carry as little data as possible through these processes to minimize performance issues. An example of a physical data integration subject area model is shown in Figure 3.15.
Figure 3.15 Physical subject data area load data integration model example
Logical Versus Physical Data Integration Models
One question that always arises in these efforts is, "Is there a need to have one set of logical data integration models and another set of physical data integration models?"
The answer for data integration models is the same as for data models, "It depends." It depends on the maturity of the data management organization that will create, manage, and own the models in terms of their management of metadata, and it depends on other data management artifacts (such as logical and physical data models).