The Art of Enterprise Information Architecture: A Conceptual and Logical View
In this chapter, we introduce the EIA Reference Architecture on the Conceptual as well as on the first Logical Layer. Both represent well-defined layers describing the EAI Reference Architecture as outlined in Chapter 2 beginning from the top. The conceptual and logical layers are the first elements for a solution design with the latter fleshing out the former.
For the Conceptual Layer, we first outline the necessary capabilities for the EIA Reference Architecture in the context of the architecture terms and the Enterprise Information Model previously introduced. An Architecture Overview Diagram (AOD) shows the various required capabilities in a consistent, conceptual overview for the EIA Reference Architecture.
By introducing architecture principles, we guide the further design of the EIA Reference Architecture. We apply them to drill-down in a first step from the Conceptual to the Logical Layer. We show the Logical View as a first graphical representation of the logical architecture and explain key Architecture Building Blocks (ABB). The architecture principles introduced in this chapter guide the design in subsequent chapters as well.
4.1 Conceptual Architecture Overview
An EIA provides an information-centric view on the overall Enterprise Architecture. Thus, any instantiation of the EIA Reference Architecture enables an enterprise to create, maintain, use, and govern all information assets throughout their lifecycle from a bottom-up perspective. From a top-down perspective, business users and technical users articulate their information needs in the context of business processes shaping the business and application architecture based on the role they perform. We develop the EIA Reference Architecture from a top-down perspective. We look at information from an end user perspective working with or operating on it to achieve certain goals. Key functional and technical capabilities provide and enable the set of operations on information required by the user community of an enterprise. Thus, we approach the Conceptual View of the EIA Reference Architecture presented in an AOD by introducing from a business perspective the required functional and technical capabilities.
In Chapter 3, we introduced the five data domains as part of the Enterprise Information Model. Not surprisingly, the EIA must cover all five data domains with appropriate capabilities as required by each individual domain.
Furthermore, there are several capabilities required that span across either some or all data domains (for example, EII). Higher-level capabilities such as Business Performance Management (BPM) are based on foundational information capabilities for the five data domains. New delivery models such as the Cloud Computing delivery model demand additional capabilities as well. Thus, for building an EIA Reference Architecture that satisfies the more advanced business requirements of today and the upcoming ones in the future, we see the need for the following additional capabilities:
- Predictive Analytics and Real Time Analytics
- Business Performance Management
- Enterprise Information Integration (EII)
- Mashup
- Information Governance
- Information Security and Information Privacy
- Cloud Computing
We explain these capabilities from a high-level perspective1 in the following sub-sections and show them conceptually in the AOD jointly afterwards. A company is not required to implement all capabilities introduced in this section as part of an instantiation of the EIA Reference Architecture. IT Architects design the specific IT solutions based on the careful analysis of the specific requirements throughout the design process.2
We start with the Metadata management capability and continue aligned with the order of the data domains as shown in the Information Reference Model in Chapter 3.
4.1.1 Metadata Management Capability
The Metadata management capability addresses the following business requirements:
- This capability helps to establish an enterprise business glossary where business terms are correlated with their technical counterparts; it also enables and facilitates efficient communication between business and IT people.
- Metadata management supports Information Governance which is key to treat information as a strategic asset.
- Metadata is a prerequisite to establish trusted information for business and technical information consumers. A user supposedly trusting information must understand the context of the information and thus must know for example the source of the data or associated data quality characteristics.
- Cost-effective problem resolution in the information supply chain requires data lineage which is based on Metadata.
- Business requirements change over time entailing change in the IT environment. Thus, impact analysis of changes in the IT infrastructure enables the understanding of a change. For example, impact analysis based on Metadata for a data transformation job change which is part of a complex series of transformation jobs would show if and how many subsequent jobs are affected.
Data lineage and impact analysis are detailed in Chapter 10 in the context of a detailed use case scenario.
4.1.2 Master Data Management Capability
The MDM capability3 (also discussed in more detail in Chapter 11) provides the following functions to the business:
- It creates the authoritative source of Master Data within an enterprise laying the foundation to establish guidelines for the lifecycle management of Master Data.
- Actionable Master Data delivering sustained business value using event mechanism on Master Data changes. (For example, three months before an insurance contract with a customer expires, the customer care representative is notified to proactively contact the customer due to an event rule.)
- Simplification and optimization of key business processes such as new product introduction and cross- and up-sell by providing state of the art business services for Master Data standardizing the way Master Data is used across the enterprise.
- It reduces cost and complexity with a central place to enforce all data quality, business rules, and access rights on Master Data.
- A centralized MDM solution is a cornerstone for effective Information Governance on Master Data allowing centralized enforcement of governance policies.
- Reporting results are improved in the analytical environment by leveraging consistent Master Data in the dimension tables in a DW.
MDM has a strong dependency on the EII capability during the Master Data Integration (MDI) phase. Using EII functions during the MDI phase, the Master Data is extracted from the current systems, cleansed, and harmonized before it is loaded into the MDM System (and possibly also in a DW system to improve data quality in the dimension tables).
4.1.3 Data Management Capability
The data management capability serves the following business needs:
- Efficient Create, Read, Update, and Delete (CRUD) functions for transactional systems processing structured Operational Data
- Appropriate enforcement of access rights to data to allow only authenticated and authorized users to work with the data
- Low costs for administration by efficient administration interfaces and autonomics
- Business resiliency of the operational applications by providing proper continuous availability functions, including high availability and disaster recovery
In essence, the data management capability provides all functions needed by transactional systems such as order entry or billing applications to manage structured Operational Data across its lifecycle.
4.1.4 Enterprise Content Management Capability
The ECM capability addresses the following business requirements:
- Compliance with legal requirements (for example, e-mail archiving)
- Efficient management of Unstructured Data (for example insurance contracts)
- Delivery of content for web application (for example, images for e-commerce solutions)
- Appropriate enforcement of access rights to Unstructured Data to allow only authenticated and authorized users to work with the data
- Business resiliency of the operational applications by providing proper continuous availability functions, including high availability and disaster recovery
- Comprehensive content-centric workflow capabilities to enable for example workflow-driven management of insurance contracts
This capability enables end-to-end management of Unstructured Data needed in many industries such as the insurance industry. It also enables compliance solutions for e-mail archiving.
4.1.5 Analytical Applications Capability
The analytical applications capability addresses two areas which are DW and Identity Analytics capability, as well as Predictive and Real Time Analytics capability.
4.1.5.1 Data Warehousing and Identity Analytics Capability
The capability area of DW and Identity Analytics applications (mostly focusing by looking into the past) delivers the following functions for a business:
- Identity Analytics can be applied to mitigate fraud or to improve homeland security by discovering non-obvious and hidden relationships.
- DWs are the foundation for reporting for business analysts and report on historical data—the past.
- Integration of analytics to cover Structured and Unstructured Data in a DW is one of the required steps towards a Dynamic Warehousing (DYW, see Chapter 13). For example, analyzing blog posts regarding a product to find out what features customers like or which parts they reported broke most often alongside selling statistics provides new insight. This insight is unavailable with reporting on Structured Data only.
- Discovery mining in a DW allows a business to discover patterns. An example would be association rule mining to find out which products are typically bought together.
Building a DW with enterprise scope where Operational Data from heterogeneous sources must be extracted, cleansed, and harmonized before it is loaded into the DW system has a strong dependency on the EII capability.
4.1.5.2 Predictive Analytics and Real Time Analytics Capability
Building Intelligent Utility Networks (IUN, see Chapter 13), improving medical treatment for prematurely born babies (see Chapter 14), or anticipating trends in customer buying decisions are use cases where companies in various industries are not satisfied with analytic functions reporting on the past. Here, two areas emerge as requirements for satisfying business needs by looking into the present or the future.
- Predictive Analytics are capabilities allowing the prediction of certain values and events in the future. For example, based on the electricity consumption patterns of the past, an energy provider would like to predict spikes in consumption in the future to optimize energy creation and to reduce loss in the infrastructure.
- Real Time Analytics are capabilities to address the need to analyze large-scale volumes of data in real time. Real Time Analytics capabilities consist of the ability of real time trickle feeds (see Chapters 8 and 13 for details on trickle feeds) into the DW, the ability of complex reporting queries executing in real time within the DW, and the ability of real time delivery of analytical insight to front-line applications. Stream analytics are another Real Time Analytics capability, which is introduced in section 4.1.7 and detailed in Chapters 8 and 14.
4.1.6 Business Performance Management Capability
BPM is a capability enabling business users to:
- Define the Key Performance Indicators (KPI) for the business.
- Monitor and measure against the defined KPIs on an ongoing basis.
- Visualize the measurements in a smart way enabling rapid decision making.
- Complement the visualization with trust indices about the quality of the underlying data putting the results in context regarding their trustworthiness.
- Intelligently act if the measurement of the KPIs indicates a need to act.
- Trigger events and notifications to business users if there are abnormalities in the data.
This capability often depends on strong analytical application capabilities.
4.1.7 Enterprise Information Integration Capability
From a business perspective, comprehensive EII (Chapter 8 provides more details on this topic) provides abilities to understand, cleanse, transform, and deliver data throughout its lifecycle. Examples include:
- Data harmonization from various Operational Data sources into an enterprise-wide DW.
- For cost and flexibility reasons, hiding complexity in the various, heterogeneous data sources of new applications should not be implemented in such a way that they are tied to specific versions of these data sources. Thus federated access must be available.
- Re-use of certain data cleansing functions such as standardization services in an SOA to achieve data consistency and improve data quality on data entry must be supported. This requires deploy capabilities of data quality functions as services.
Extract-Transform-Load (ETL) typically identifies EII capabilities to extract data from source systems, to transform it from the source to the target data model, and finally to load it into the target system. ETL thus is most often a batch mode operation. Typical characteristics are that data volumes involved are generally large, the process and load cycles long, and complex aggregations and transformations are required.
During the last two to three years, in many enterprises, ETL changed from custom-built environments with little or no documentation to a more integrated approach by using suitable ETL platforms. Improved productivity was the result of object and transformation re-use, strict methodology, and better Metadata support—all functions provided by the new ETL platforms.
A discipline known as Enterprise Application Integration (EAI)4 is typically considered for solving application integration problems in homogeneous as well as heterogeneous application environments. Historically, applications were integrated with point-to-point interfaces between the applications—this approach of tightly coupled applications failed to scale with growing number of applications in an enterprise because the maintenance costs were simply too high. Every time an application was changed the point-to-point connections themselves as well as the applications on the other end of the connection had to be changed too. Avoiding these costs and increasing flexibility for evolving applications were drivers of the wide-spread deployment of an SOA. As a result, the applications became more loosely coupled creating more agility and flexibility for business process orchestration. Now, in many cases, applications are application-integrated with an Enterprise Service Bus (ESB)5 based on message-oriented middleware. With that, an interface change of one application can be hidden from other applications because the ESB (through mediation using a new interface map) can hide this change. Compared to point-to-point connections, this is a significant advantage. Over the years, the discipline of EAI created many architecture patterns, such as Publish/Subscribe. IT Architects now have an abundance of materials6 available for this domain.
The use of ESB components opened new possibilities from an EIA perspective. High latency is the major disadvantage of traditional ETL moving data from one application to the next. Streams techniques and certain EAI techniques based on ESB infrastructure can be used to solve the problem of high latency for data movement. For example, if a customer places an order through a website of the e-commerce platform and expects product delivery in 24 hours or less, a weekly batch integration to make fulfillment and billing applications aware of the new order is inappropriate. Today, EAI solves this by providing asynchronous and synchronous near real time and real time capabilities useful for data synchronization across systems. EAI can effectively move data among systems in real time, but does not define an aggregated view of the data objects or business entities nor does it deal with complex aggregation problems. It resolves transformations of data generally only managed at the message level. Thus, application integration techniques are often found connecting Online Transactional Processing (OLTP) systems with each other.
To date, the term EII has been typically used to summarize data placement capabilities based on data replication techniques and capabilities to provide access to data across various data sources. Providing a unified view of data from disparate systems comes with a unique set of requirements and constraints. First, the data should be accessible in a real-time fashion, which means that we should be accessing current data on the source systems as opposed to accessing stale data from a previously captured snapshot. Second, the semantics, or meaning, of data needs to be resolved across systems. Different systems might represent the data with different labels and formats that are relevant to their respective uses, but that requires some sort of correlation by the end user to be useful to them. Duplicate entries should be removed, validity checked, labels matched, and values reformatted. The challenges with this information integration technique involve governing the use of a collection of systems in real time and creating a semantic layer that should map all data entities in a coherent view of the enterprise data.
With this overview of the traditional use of the terms ETL, EAI and EII, we propose now the following new definition of EII:
EII consists of a set of new capabilities including Discover, Profile, Cleanse, Transform, Replicate, Federate, Stream, and Deploy capabilities. These techniques for information integration are applied across all five data domains: Metadata, Master Data, Operational Data, Unstructured data, and Analytical Data. EII in this new definition includes the former notion of ETL and EII. It also covers the intersection of EII with EAI.
We briefly introduce the new set of EII capabilities which are described in more detail in Chapter 8:
- Discover capabilities—They detect logical and physical data models as well as other Technical and Business Metadata. They enable understanding of the data structures and business meaning.
- Profile capabilities—They consist of techniques such as column analysis, cross-table analysis, and semantic profiling. They are applied to derive the rules necessary for data cleansing and consolidation of data because they unearth data quality issues in the data, such as duplicate values in a column supposedly containing only unique values, missing values for fields, or non-standardized address information.
- Cleanse capabilities—They improve data quality. Name7 and address standardization, data validation (for example address validation against postal address dictionaries), matching to identify duplicate records enabling reconciliation through survivorship rules, and other data cleansing logic are often used.
- Transform capabilities—They are applied to harmonize data. A typical example is the data movement from several operational data sources into an enterprise DW. In this scenario, transformation requires two steps: First, the structural transformation of a source data model to a target data model. Second, a semantical transformation mapping code values in the source system to appropriate code values in the target system.
- Replicate capabilities—They deliver data to consumers. Typical data replication technologies are database focused using either trigger-based or transactional log-based Change Data Capture (CDC) mechanisms to identify the deltas requiring replication.
- Federate capabilities—They provide transparent and thus virtualized access to heterogeneous data sources. From an information-centric view, federation is the topmost layer of virtualization techniques. Federation improves flexibility by not tying an application to a specific database or content management system vendor. Another benefit is to avoid costs with consolidating data into a single system by leaving it in place. The benefits of federation can be used for Structured Data (also known as data federation) and Unstructured Data (also known as content federation).
- Stream capabilities—They are a completely new set of capabilities that EIA must cover. The need for enterprises to have them is basically the realization that Structured and Unstructured Data volumes reached levels where the sheer amount of data cannot be persisted anymore. Consider, for example, the total amount of messages exchanged over a stock trading system with automated brokering agents run by large financial institutions. In such an environment, streaming capabilities consist of low-latency data streaming infrastructure and a framework to deploy Real Time Analytics onto the data stream generating valuable business insight and identifying the small fraction of the data that requires an action or is worthwhile to be persisted for further processing.
- Deploy capabilities—They provide the ability to deploy EII capabilities as consumable information services. For example, a federated query can be exposed as an information service. In this case, the federated query might be invoked as a real time information service using specific protocols specified at deploy time.
4.1.8 Mashup Capability
Mashup capabilities (see Chapter 12 for more details on Mashups) enable a business to quickly build web-based, situational applications at low cost for typically small user groups (for example, all members of a department). The Mashup capability must allow non-technical users to create new value and insight from the combined information by mashing together information from various sources.
4.1.9 Information Governance Capability
As outlined in Chapter 3, the Information Governance capability is a crucial part for the design, deployment, and control processes of any instantiation of the EIA throughout its lifecycle. The Information Governance capability enables a business to manage and govern its information as strategic assets. More specifically it:
- Aligns people, processes, and technology for the implementation of an enterprise-wide strategy to implement policies governing the creation and use of information.
- Assigns Information Stewards to govern information assets throughout their lifecycle. The Information Stewards govern each information asset in the scope of defined policies, which might be automated regarding enforcement or might require a human being executing a task.
- Assigns a balance sheet describing the value of the information and the impact of loss or improper management from a data quality perspective.
4.1.10 Information Security and Information Privacy Capability
The Information Security and Information Privacy capability are relevant for any enterprise for two major reasons:
- Information Security functions protect information assets from unauthorized access, which prevents the probability of loss of mission critical information.
- Information Privacy functions enable a company to comply for example with legal regulations protecting the privacy of Personally Identifiable Information (PII).
Thus, this capability spans across all other capabilities previously introduced in this section. For example, the Information Governance capability would define the security policies for information, whereas the Data Management capability would either need to deliver the required security features itself or through integration with external systems delivering them. In the Component Model presented in Chapter 5, this capability is decomposed in a number of different components to address all the requirements with coherently grouped functions by component. Foreshadowing them you can anticipate the following components:
- A component providing comprehensive authentication and authorization services.
- A component (known as a De-Militarized Zone [DMZ]) protecting backend systems from external and internal sub-networks using a Reverse Proxy Pattern.
- Base security services such as encryption and data masking services delivered as a security sub-component through the IT Service & Compliance Management Service Component.
4.1.11 Cloud Computing Capability
Due to the business value pledge, the Cloud Computing capability is necessary for many enterprises today and represents a new delivery model for IT. However, the Cloud Computing delivery model is more than just a new way of billing for IT resources. A new set of IT capabilities has been developed and significant changes to existing IT components have been applied. Not surprisingly, the Cloud Computing delivery model also affects the Information Management domain and therefore the EIA within an enterprise. Thus, we briefly introduce functional and technical capabilities relevant for the Cloud Computing delivery model. (See more details on the implication of Cloud Computing in Chapter 7, section 7.4.)
- Multi-Tenancy capabilities—They define the sharing of resources as part of the multi-tenancy concept. From a multi-tenancy perspective, there are distinct layers to apply the concept of multi-tenancy such as the application layer, the information layer (for example, shared databases where each tenant has its own schema for maintenance operations), and the infrastructure layer (for example, multiple Operating System [OS] instances on the same hardware).
- Self-Service capabilities—They define services to allow the tenant to subscribe to IT services delivered through Cloud Computing with a self-service User Interface (UI). For example, you can subscribe through a web-based self-service UI to collaboration services on LotusLive8 or use virtualized IT resources in the Amazon Elastic Compute Cloud9 on a pay-per-use model. While subscribing, you select the parameters for the Service-Level Agreements (SLA) as needed.
- Full Automation capabilities—They define services to allow cost reduction on the IT service provider side. Thus, after a tenant subscribes to a service offered, the deployment and management throughout the lifecycle of the service must be fully automated from the point the service is provisioned to the point the service is decommissioned.
- Virtualization capabilities—They provide for the virtualization of resources to enable multi-tenancy. From an information perspective, key capabilities are storage hardware virtualization as well as a completely virtualized IO layer. For example, a virtualized IO layer10 has characteristics such as sharing of storage for multiple consumers with seamless expansion and reduction, high availability and reliability (for example, to allow capacity expansion or maintenance without downtime), policy-driven automation and management, and high performance.
- Elastic Capacity capabilities—They provide services to still comply to SLAs even when peak workloads occur. This means the computing capacity needs to be “elastic” in the application, information, and infrastructure layer growing and shrinking as demanded. The assignment, removal, and distribution of resources among tenants has to be autonomic and dynamic to deliver cost efficient workload balancing with optimized use of the available resources.
- Metering capabilities—They provide instrumentation and supporting services to know how many resources a certain tenant has consumed. For example, the costs for the services provider offering a database cloud service are affected by the amount of data managed for a tenant who subscribed to this service. More data requires more storage. Higher storage consumption by the service consumer causes an increase of cost for the service provider. For such a cloud service, the ability to meter storage consumption for a database would be a key metering function.
- Pricing capabilities—They compute the costs for the tenant of the subscribed services based on resource consumption measured with metering capabilities. It also allows the definition and adjustment of prices over time to reflect an increase or a decrease of costs used by the cloud service provider. Pricing services are invoked by the billing services for bill creation.
- Billing capabilities—They are used for accounting purposes. They create and send out the bill to the tenant. Monitoring might affect the billing and pricing services. For example, if the SLA promised to the cloud service consumer haven’t been met and the monitoring component measured this fact, the amount of money charged to the service consumer might be reduced as part of the contract.
- Monitoring capabilities—They provide management services of a cloud requiring end-to-end monitoring. Monitoring capabilities are an important element of measuring adherence to performance and other requirements of resources and applications. In cloud environments, the task of monitoring becomes more critical due to the highly virtualized environment.