HAPPY BOOKSGIVING
Use code BOOKSGIVING during checkout to save 40%-55% on books and eBooks. Shop now.
Register your product to gain access to bonus material or receive a coupon.
Making Data Integration Work: How to Systematically Reduce Cost, Improve Quality, and Enhance Effectiveness
Today’s enterprises are investing massive resources in data integration. Many possess thousands of point-to-point data integration applications that are costly, undocumented, and difficult to maintain. Data integration now accounts for a major part of the expense and risk of typical data warehousing and business intelligence projects--and, as businesses increasingly rely on analytics, the need for a blueprint for data integration is increasing now more than ever.
This book presents the solution: a clear, consistent approach to defining, designing, and building data integration components to reduce cost, simplify management, enhance quality, and improve effectiveness. Leading IBM data management expert Tony Giordano brings together best practices for architecture, design, and methodology, and shows how to do the disciplined work of getting data integration right.
Mr. Giordano begins with an overview of the “patterns” of data integration, showing how to build blueprints that smoothly handle both operational and analytic data integration. Next, he walks through the entire project lifecycle, explaining each phase, activity, task, and deliverable through a complete case study. Finally, he shows how to integrate data integration with other information management disciplines, from data governance to metadata. The book’s appendices bring together key principles, detailed models, and a complete data integration glossary.
Coverage includes
A Design Technique: Data Integration Modeling
Download the sample pages (includes Chapter 3 and Index)
Preface xix
Acknowledgments xxii
About the Author xxiii
Introduction: Why Is Data Integration Important? 1
Part 1 Overview of Data Integration 5
Chapter 1 Types of Data Integration 7
Data Integration Architectural Patterns 7
Enterprise Application Integration (EAI) 8
Service-Oriented Architecture (SOA) 9
Federation 12
Extract, Transform, Load (ETL) 14
Common Data Integration Functionality 15
Summary 16
End-of-Chapter Questions 16
Chapter 2 An Architecture for Data Integration 19
What Is Reference Architecture? 19
Reference Architecture for Data Integration 20
Objectives of the Data Integration Reference Architecture 21
The Data Subject Area-Based Component Design Approach 22
A Scalable Architecture 24
Purposes of the Data Integration Reference Architecture 26
The Layers of the Data Integration Architecture 26
Extract/Subscribe Processes 27
Data Integration Guiding Principle: “Read Once, Write Many” 28
Data Integration Guiding Principle: “Grab Everything” 28
Initial Staging Landing Zone 29
Data Quality Processes 31
What Is Data Quality? 31
Causes of Poor Data Quality 31
Data Quality Check Points 32
Where to Perform a Data Quality Check 32
Clean Staging Landing Zone 34
Transform Processes 35
Conforming Transform Types 35
Calculations and Splits Transform Types 35
Processing and Enrichment Transform Types 36
Target Filters Transform Types 38
Load-Ready Publish Landing Zone 39
Load/Publish Processes 40
Physical Load Architectures 41
An Overall Data Architecture 41
Summary 42
End-of-Chapter Questions 43
Chapter 3 A Design Technique: Data Integration Modeling 45
The Business Case for a New Design Process 45
Improving the Development Process 47
Leveraging Process Modeling for Data Integration 48
Overview of Data Integration Modeling 48
Modeling to the Data Integration Architecture 48
Data Integration Models within the SDLC 49
Structuring Models on the Reference Architecture 50
Conceptual Data Integration Models 51
Logical Data Integration Models 51
High-Level Logical Data Integration Model 52
Logical Extraction Data Integration Models 52
Logical Data Quality Data Integration Models 53
Logical Transform Data Integration Models 54
Logical Load Data Integration Models 55
Physical Data Integration Models 56
Converting Logical Data Integration Models to Physical Data Integration Models 56
Target-Based Data Integration Design Technique Overview 56
Physical Source System Data Integration Models 57
Physical Common Component Data Integration Models 58
Physical Subject Area Load Data Integration Models 60
Logical Versus Physical Data Integration Models 61
Tools for Developing Data Integration Models 61
Industry-Based Data Integration Models 63
Summary 64
End-of-Chapter Questions 65
Chapter 4 Case Study: Customer Loan Data Warehouse Project 67
Case Study Overview 67
Step 1: Build a Conceptual Data Integration Model 69
Step 2: Build a High-Level Logical Model Data Integration Model 70
Step 3: Build the Logical Extract DI Models 72
Confirm the Subject Area Focus from the Data Mapping Document 73
Review Whether the Existing Data Integration Environment Can Fulfill the Requirements 74
Determine the Business Extraction Rules 74
Control File Check Processing 74
Complete the Logical Extract Data Integration Models 74
Final Thoughts on Designing a Logical Extract DI Model 76
Step 4: Define a Logical Data Quality DI Model 76
Design a Logical Data Quality Data Integration Model 77
Identify Technical and Business Data Quality Criteria 77
Determine Absolute and Optional Data Quality Criteria 80
Step 5: Define the Logical Transform DI Model 81
Step 6: Define the Logical Load DI Model 85
Step 7: Determine the Physicalization Strategy 87
Step 8: Convert the Logical Extract Models into Physical Source System Extract DI Models 88
Step 9: Refine the Logical Load Models into Physical Source System Subject Area Load DI Models 90
Step 10: Package the Enterprise Business Rules into Common Component Models 92
Step 11: Sequence the Physical DI Models 94
Summary 95
Part 2 The Data Integration Systems Development Life Cycle 97
Chapter 5 Data Integration Analysis 99
Analyzing Data Integration Requirements 100
Building a Conceptual Data Integration Model 101
Key Conceptual Data Integration Modeling Task Steps 102
Why Is Source System Data Discovery So Difficult? 103
Performing Source System Data Profiling 104
Overview of Data Profiling 104
Key Source System Data Profiling Task Steps 105
Reviewing/Assessing Source Data Quality 109
Validation Checks to Assess the Data 109
Key Review/Assess Source Data Quality Task Steps 111
Performing Source\Target Data Mappings 111
Overview of Data Mapping 112
Types of Data Mapping 113
Key Source\Target Data Mapping Task Steps 115
Summary 116
End-of-Chapter Questions 116
Chapter 6 Data Integration Analysis Case Study 117
Case Study Overview 117
Envisioned Wheeler Data Warehouse Environment 118
Aggregations in a Data Warehouse Environment 120
Data Integration Analysis Phase 123
Step 1: Build a Conceptual Data Integration Model 123
Step 2: Perform Source System Data Profiling 124
Step 3: Review/Assess Source Data Quality 130
Step 4: Perform Source\Target Data Mappings 135
Summary 145
Chapter 7 Data Integration Logical Design 147
Determining High-Level Data Volumetrics 147
Extract Sizing 148
Disk Space Sizing 148
File Size Impacts Component Design 150
Key Data Integration Volumetrics Task Steps 150
Establishing a Data Integration Architecture 151
Identifying Data Quality Criteria 154
Examples of Data Quality Criteria from a Target 155
Key Data Quality Criteria Identification Task Steps 155
Creating Logical Data Integration Models 156
Key Logical Data Integration Model Task Steps 157
Defining One-Time Data Conversion Load Logical Design 163
Designing a History Conversion 164
One-Time History Data Conversion Task Steps 166
Summary 166
End-of-Chapter Questions 167
Chapter 8 Data Integration Logical Design Case Study 169
Step 1: Determine High-Level Data Volumetrics 169
Step 2: Establish the Data Integration Architecture 174
Step 3: Identify Data Quality Criteria 177
Step 4: Create Logical Data Integration Models 180
Define the High-Level Logical Data Integration Model 181
Define the Logical Extraction Data Integration Model 183
Define the Logical Data Quality Data Integration Model 187
Define Logical Transform Data Integration Model 190
Define Logical Load Data Integration Model 191
Define Logical Data Mart Data Integration Model 192
Develop the History Conversion Design 195
Summary 198
Chapter 9 Data Integration Physical Design 199
Creating Component-Based Physical Designs 200
Reviewing the Rationale for a Component-Based Design 200
Modularity Design Principles 200
Key Component-Based Physical Designs Creation Task Steps 201
Preparing the DI Development Environment 201
Key Data Integration Development Environment Preparation Task Steps 202
Creating Physical Data Integration Models 203
Point-to-Point Application Development--The Evolution of Data Integration Development 203
The High-Level Logical Data Integration Model in Physical Design 205
Design Physical Common Components Data Integration Models 206
Design Physical Source System Extract Data Integration Models 208
Design Physical Subject Area Load Data Integration Models 209
Designing Parallelism into the Data Integration Models 210
Types of Data Integration Parallel Processing 211
Other Parallel Processing Design Considerations 214
Parallel Processing Pitfalls 215
Key Parallelism Design Task Steps 216
Designing Change Data Capture 216
Append Change Data Capture Design Complexities 217
Key Change Data Capture Design Task Steps 219
Finalizing the History Conversion Design 220
From Hypothesis to Fact 220
Finalize History Data Conversion Design Task Steps 220
Defining Data Integration Operational Requirements 221
Determining a Job Schedule for the Data Integration Jobs 221
Determining a Production Support Team 222
Key Data Integration Operational Requirements Task Steps 224
Designing Data Integration Components for SOA 225
Leveraging Traditional Data Integration Processes as SOA Services 225
Appropriate Data Integration Job Types 227
Key Data Integration Design for SOA Task Steps 227
Summary 228
End-of-Chapter Questions 228
Chapter 10 Data Integration Physical Design Case Study 229
Step 1: Create Physical Data Integration Models 229
Instantiating the Logical Data Integration Models into a Data Integration Package 229
Step 2: Find Opportunities to Tune through Parallel Processing 237
Step 3: Complete Wheeler History Conversion Design 238
Step 4: Define Data Integration Operational Requirements 239
Developing a Job Schedule for Wheeler 240
The Wheeler Monthly Job Schedule 240
The Wheeler Monthly Job Flow 240
Process Step 1: Preparation for the EDW Load Processing 241
Process Step 2: Source System to Subject Area File Processing 242
Process Step 3: Subject Area Files to EDW Load Processing 245
Process Step 4: EDW-to-Product Line Profitability Data Mart Load Processing 248
Production Support Staffing 248
Summary 249
Chapter 11 Data Integration Development Cycle 251
Performing General Data Integration Development Activities 253
Data Integration Development Standards 253
Error-Handling Requirements 255
Naming Standards 255
Key General Development Task Steps 256
Prototyping a Set of Data Integration Functionality 257
The Rationale for Prototyping 257
Benefits of Prototyping 257
Prototyping Example 258
Key Data Integration Prototyping Task Steps 261
Completing/Extending Data Integration Job Code 262
Complete/Extend Common Component Data Integration Jobs 263
Complete/Extend the Source System Extract Data Integration Jobs 264
Complete/Extend the Subject Area Load Data Integration Jobs 265
Performing Data Integration Testing 266
Data Warehousing Testing Overview 267
Types of Data Warehousing Testing 268
Perform Data Warehouse Unit Testing 269
Perform Data Warehouse Integration Testing 272
Perform Data Warehouse System and Performance Testing 273
Perform Data Warehouse User Acceptance Testing 274
The Role of Configuration Management in Data Integration 275
What Is Configuration Management? 276
Data Integration Version Control 277
Data Integration Software Promotion Life Cycle 277
Summary 277
End-of-Chapter Questions 278
Chapter 12 Data Integration Development Cycle Case Study 279
Step 1: Prototype the Common Customer Key 279
Step 2: Develop User Test Cases 283
Domestic OM Source System Extract Job Unit Test Case 284
Summary 287
Part 3 Data Integration with Other Information Management Disciplines 289
Chapter 13 Data Integration and Data Governance 291
What Is Data Governance? 292
Why Is Data Governance Important? 294
Components of Data Governance 295
Foundational Data Governance Processes 295
Data Governance Organizational Structure 298
Data Stewardship Processes 304
Data Governance Functions in Data Warehousing 305
Compliance in Data Governance 309
Data Governance Change Management 310
Summary 311
End-of-Chapter Questions 311
Chapter 14 Metadata 313
What Is Metadata? 313
The Role of Metadata in Data Integration 314
Categories of Metadata 314
Business Metadata 315
Structural Metadata 315
Navigational Metadata 317
Analytic Metadata 318
Operational Metadata 319
Metadata as Part of a Reference Architecture 319
Metadata Users 320
Managing Metadata 321
The Importance of Metadata Management in Data Governance 321
Metadata Environment Current State 322
Metadata Management Plan 322
Metadata Management Life Cycle 324
Summary 327
End-of-Chapter Questions 327
Chapter 15 Data Quality 329
The Data Quality Framework 330
Key Data Quality Elements 331
The Technical Data Quality Dimension 332
The Business-Process Data Quality Dimension 333
Types of Data Quality Processes 334
The Data Quality Life Cycle 334
The Define Phase 336
Defining the Data Quality Scope 336
Identifying/Defining the Data Quality Elements 336
Developing Preventive Data Quality Processes 337
The Audit Phase 345
Developing a Data Quality Measurement Process 346
Developing Data Quality Reports 348
Auditing Data Quality by LOB or Subject Area 350
The Renovate Phase 351
Data Quality Assessment and Remediation Projects 352
Data Quality SWAT Renovation Projects 352
Data Quality Programs 353
Final Thoughts on Data Quality 353
Summary 353
End-of-Chapter Questions 354
Appendix A Exercise Answers 355
Appendix B Data Integration Guiding Principles 369
Write Once, Read Many 369
Grab Everything 369
Data Quality before Transforms 369
Transformation Componentization 370
Where to Perform Aggregations and Calculations 370
Data Integration Environment Volumetric Sizing 370
Subject Area Volumetric Sizing 370
Appendix C Glossary 371
Appendix D Case Study Models
Appendix D is an online-only appendix. Print-book readers can download the appendix at www.ibmpressbooks.com/title/9780137084937. For eBook editions, the appendix is included in the book.
Index 375