Problems with Data
For a long time, computers were used strictly as calculating machines. Data was fed into them manually from punch cards, a holdover from tabulating machinery. The computer crunched the numbers and generated, with limited flexibility, reams of reports on the familiar "greenbar." Data in computers was typically lost because there was nowhere to store it. Storage wasn't considered important anyway, as punch cards and paper reports, usually retyped into forms with a typewriter, were considered the official system of record. The computer and its programs and data were merely tools to speed up some work.
Both greenbar and official reports were stored for recordkeeping. This is how it had always been done, and the appearance of a "big calculator" didn't change anything right away. In fact, many firms, well into the 1990s, kept paper records long after the appearance of magnetic tape and later nonsequential storage devices, such as magnetic disk drives. Perhaps the tradition of keeping important records, such as legal documents, on paper contributed to this habit. It wasn't until total quality management (TQM) and business process reengineering (BPR) took hold that the inefficiencies of large organizations were exposed. Unfortunately, the attitude that computing was just a calculator and not to be trusted for real work would take much longer to break down than it took to get started.
The Reporting Gap
With the introduction of permanent storage devices, however, computers began to take on a role as archival devices as well as number crunchers and report generators. Although paper was king at the time, the economy of microfiche was easily understood and accepted because it didn't alter the concept of computers outputting information in a format understood by businesspeople. Microfiche just compressed information and made it more durable. However, magnetic storage devices, especially tape, ushered in a completely new way of thinking about computers in business. Tapes stored everything—data, report streams, and programs. The tapes even substituted for core memory, allowing programmers the luxury of writing programs that used more resources than were actually available. Computers gradually became bona fide business tools.
Computers gained speed and flexibility through the advancements of disk drives, solid-state memory, and virtual memory operating systems. Eventually, interfaces or "transaction monitors," such as IBM's CICS, opened a new discipline: online systems. Instead of batch machines that ran serial scheduled jobs, transaction monitors allowed programmers to create applications for real-time, instantaneous processing of multiple processes (although most small "transactions" were stored for later batch processing). Soon, transaction processing was born, the polar opposite of batch computing. Online transaction processing (OLTP) emerged as the next big thing and never really faded. The bulk of corporate computing, even as it moves through generations of technology, is still transaction-oriented.
The advent of formal data methodologies and software (Chen, Codd, Computer- Aided Software Engineering [CASE] tools, and relational databases) also led to better appreciation and use of the information stored in computing technology. So we moved from computers for processing to computers for information. One thing was missing from these transactions, however—decision making. All the "brainpower" in transactions came from people. The computer's role was to capture and store the data needed to record the transaction. If a judgment had to be made, a person made it. If the transaction had to be approved, the system waited for someone to approve it. These transactions were increasingly fast and essential to business operations, but they remained unintelligent.
Additionally, the more transaction-oriented corporate computing became, the more alienated it became from its original purpose: reporting. As a result, a new industry emerged in the late 1970s that was designed to address this gap. At the time, this industry was called decision support systems (DSSs) or sometimes fourth-generation languages (4GLs), but its goal was to enable users, mostly businesspeople, to create models and generate reports without the IT department's involvement. This industry later became known as business intelligence, but almost all the early providers of these products fell by the wayside before then.
Mainstream systems evolved from batch to OLTP to client/server and finally, by 2000, to Y2K/enterprise applications, as shown in Figure 3.1. Business intelligence and analytics developed separately, supported by functional areas, particularly finance and marketing. As computing in general became more advanced and complicated, it ended up back in IT. It was called "data warehousing" and "business intelligence," but it was still a separate discipline. The original motivation for this separation was mostly IT's neglect of reporting and analysis, but as the need for and volume of business intelligence grew, IT took control to protect its existing resources. The gap has continued to grow and is, with today's technology, largely artificial and unnecessary.
Figure 3.1 The divergence of applications and analytics as computing evolved
Many of the early DSS tools were quite ingenious, with features that current analytical tools would envy, such as nonprocedural modeling, multidimensional orientation, goal seeking, statistical and stochastic routines, graphics, and tools for development and collaboration. Their weakness was data. Without the data in transactional systems, these tools were limited to whatever data businesspeople could generate on their own, such as budget, planning, and forecasting data. In many cases, getting existing data required reentering data from printed reports, a time-consuming and tedious process. Nevertheless, these tools flourished, especially in planning and budgeting roles where the need for external data was low.
The IT department wasn't supportive of these tools. There was a constant refrain that they would "bring the mainframe to its knees," but in practice that rarely happened. What did happen was that for the first time, "power users" emerged—non-IT people who could actually develop applications and generate reports from computers, a privileged group that endures to this day. In fact, there are so many of them, and their output is so important, that this group has a name of its own: shadow IT.2
Early DSS vendors met their Waterloo in the early 1980s when a small product was launched: Lotus 1-2-3. It turned out that a PC with a Lotus 1-2-3 spreadsheet was capable of doing much, if not all, of the functions of mainframe- or mini-based DSS tools. Connectivity was actually worse, because PCs weren't connected to the mainframe, but the freedom and novelty was so great that, within a few years, most DSS vendors disappeared.
A bigger drain on IT was the constant flood of requests for data extracted from operational systems, both decision support systems and PCs. Many alternatives were tried:
- Formal programming requests with highly specific requirements
- Generic abstracts to flat files that could be shared by more than one request
- Nonprocedural programming languages, such as A Programming Language (APL), that allowed a less programmer-centered approach
- Ad hoc query tools, such as Query Management Facility (QMF), so that users could create their own abstracts
In the end, none of these approaches worked well because, for the most part, they couldn't be maintained. The number of extracts expanded geometrically, as shown in simplified form in Figure 3.2, and changes in programs, requirements, and tools overwhelmed IT's ability to respond to requests. There were no standards or best practices for non-IT people to develop libraries of extraction and importation routines, so the "self-service" aspect suffered, too. The sheer number of extracts was also a burden on often heavily used mainframe and minicomputer hosts.
Figure 3.2 Simplified representation of extracts from various systems for diverse reporting needs
Enter the Data Warehouse
Originally, the data warehouse (or information warehouse, as IBM called it) was designed to solve a problem for IT departments, not end users. The goal was to build a database to provide in-place access to physically distributed databases and serve as a repository for satisfying report requests from end users (because individual silo databases couldn't do so), not to support self-service or direct access. This key point explains why even current data warehouse approaches often use a layered, after-the-fact batch architecture with a substantial investment in an enterprise data warehouse that knowledge workers never, or rarely, query directly.
The term "data warehouse" has an uncertain provenance. The most influential person who shaped the concept's development is probably Bill Inmon. For more than 20 years, he has dominated as a thought leader, with dozens of books and countless appearances and publications. However, the practice of data warehousing has surged ahead on many fronts, with broad adoption of Inmon's original work, Ralph Kimball's architecture and methodology, and the contributions of many others. Data warehousing has become a well-established and vibrant strand in the IT mainstream.
However, in the early to mid-1980s, a typical organization owned an IBM mainframe that struggled to perform all scheduled jobs. The computer often simply ran out of computing cycles to do all the work, necessitating an upgrade to a newer, more powerful machine. This solution was expensive and temporary, because new requests were already lined up to absorb all the new computing cycles. This management-from-scarcity approach resulted in a sort of triage process in which requests for new applications were evaluated. Most were put into a queue, the so-called backlog, where they languished sometimes indefinitely. This backlog was largely made up of business departments' requests for reports, because preference was given to operational programs.
The original concept of data warehousing was quite simple: Buy another mainframe, install a more "friendly" operating system (such as IBM's VM/CMS), and provide a set of utilities to build a relational database of data culled from various operational systems. This solution would relieve the other mainframe's load of numerous and overlapping extracts and allow IT to service users' reporting needs with tools of its choosing. How reporting needs were met took different forms. At the simplest level, the new machine simply took copies of data from various systems, in whole or in part. From these files or databases, programmers could draw their own extracts for reporting programs without affecting the "main" mainframe's performance.
This method is very different from the current concept of data warehousing. For one thing, there were no business users in the data warehouse; it was designed for IT. Business units, with or without support from IT, developed their own reporting and analysis solutions to compensate for the weak reporting operational systems offered. None of these solutions, however, was completely effective, and all were expensive and tedious to build and maintain. Eventually, power users of DSS tools and PCs got access to the early data warehouse, and it became apparent that the current arrangement wasn't satisfactory. Combining information from multiple extracts, the data simply did not line up. Whether the problem was caused by coding errors, sloppy data entry, or timing errors, merging information across the various disconnected source systems or "stovepipes" was impossible. Solving this problem was the beginning of data warehousing as it is known today: the integration of disparate data into a common logical model. In time, data warehousing became a structured solution to address these intractable reporting and analysis problems:
- Performance—Systems designed for transaction processing performance, especially in an era of expensive, proprietary hardware, were unable to tolerate the added load of query and report processing.
- Data quality—OLTP system data integrity was limited to the application's needs, so integrating data across systems was difficult to impossible.
- Access—Initially, connectivity issues were paramount, but security, licensing, and organizational boundary disputes also restricted access to information.
- Stewardship of historical data—OLTP systems didn't maintain the historical analysis needed for variance reporting, trend analysis, and even statutory and regulatory requirements.
In the 20 or so years since this concept emerged, the data warehouse has taken on far more responsibility, some of which it still hasn't met completely. For instance, many see the "enterprise" data warehouse as the "single version of the truth," a comprehensive repository of the one true set of data and semantics that everyone can get behind. Having absorbed the role of business intelligence, it's also the architecture for all forms of end-user reporting, query, and analysis. It's the plug that fills the gap between operational and analytical processing, a gap that has been steadily widening since the advent of online processing. As Figure 3.3 shows, a data warehouse is the hub for spokes that lead to operational systems, analysts, customers, performance management, and more.
Figure 3.3 A modern data warehouse plays a more central role in information systems (teradata)
A functioning data warehouse is a collection of data, models, metadata, processes, practices, standards, and services. At the most fundamental level, a data warehouse gathers data from other systems and transforms it into a merged model of sorts. This process can range from preserving almost all the original data at the most detailed level to compressing and summarizing at a much higher level of aggregation or, in most cases, both. What's most important is finding a way to preserve as much of the data's original intent as possible yet fit it into a single schema, at least initially. This process often involves "cleaning" the data because data in various source systems might be of poor quality, especially historical data. Another unique characteristic of data warehouses is the extent of historical data they retain. This data provides the basis for tasks such as trend analysis, statutory and regulatory reporting, and variance reporting.
Keeping historical data presents a new set of challenges, however. In today's fast-changing world, shoehorning three-year-old data into current data models can be exceedingly difficult. Data warehousing has developed some ingenious methods for pulling off this process, but keep in mind that these innovations were added to the original concept.
Data warehouses have also prolonged and extended the gap between operational and analytical processes. Because data warehouses and the business intelligence tools used with them were separate disciplines from other enterprise applications, even the small amount of reporting and analysis that occurred in operational systems gradually migrated to data warehouse/business intelligence systems.
The data warehouse was a simple idea, but in practice, its implementation caused, and still is causing, problems. Extracting data from source systems and transforming it into an integrated whole is a monumental effort. The latency caused by the load and refresh cycle devalues data. The scale of data warehouses grew exponentially, leading to spiraling costs and performance problems. The industry divided over polarizing issues of database design, architecture, and methodology. Most of all, the data warehouse's purpose changed dramatically from a tool for programmers to an analytical environment for knowledge workers. Enter business intelligence.
Business Intelligence
The terms "data warehousing" and "business intelligence" are like binary star systems: They orbit each other, but from a distance, they appear to be a single entity. Because the subjects of data warehousing and business intelligence are somewhat aligned with EDM, understanding the differences is important, as is knowing how each evolved, what problems they were designed to solve, and how well they performed.
In its purest form, business intelligence is supposed to be part of decision making in an organization. Phrases such as "better data for improved decision making," "getting the right information to the right person at the right time," and "the single version of the truth" are common. The focus is on informing people with data, who then go about making decisions. Business intelligence is designed to inform people and provide a way for them for evaluate information in retrospect. A branch of business intelligence now called performance management also has provisions for forecasting and planning, but in practice, most of performance management is still retrospective analysis and reporting.
Some people refer to the process of people getting and using information as business intelligence, but the term was coined when many providers, concepts, and methods for today's business intelligence didn't exist. Some fuzziness in the definition of business intelligence is unavoidable. In fact, the term "business intelligence" has more or less subsumed the practice of data warehousing and has become synonymous with a range of functionality that business intelligence vendors offer.
Of all the types of business intelligence, reporting, ad hoc query, online analytical processing (OLAP), dashboards, scorecards, visualization, and even data mining, one common thread is delivering information to people. With transaction systems still relying on human intelligence for decision making and judgment, business intelligence has sold itself to a new group of users by promising to improve decisions. If people must make a decision because the system can't, at least they should make one informed by all the facts or data in the data warehouse. Even business activity monitoring (BAM), which catches data on the fly and applies some simple rules to process data or let it go, ultimately alerts people through visual devices. Some BAM products are capable of bypassing people in the loop and alerting applications, services, or processes directly, but this feature is unusual. Most implementations update digital dashboards for people to view and navigate.
Operational Business Intelligence
As the gap between operational and analytical processes closes, thinking about the role of business intelligence in this new landscape is natural. Latency in business intelligence is an issue because data warehouses are updated monthly, weekly, and often daily, but rarely within a business day. Operational reporting or monitoring that requires fresher data is a problem. The concept of a data warehouse is based on layers of schema that are updated in batch offline and then conformed, indexed, preaggregated, and otherwise tuned and dispatched. Trickling in new data in real time disrupts this model.
Another problem is that after data is culled from its source and integrated into the data warehouse, the connection to its source can be lost. Although vendors have invested heavily in maintaining these links, persistent connectivity problems make data warehouses a poor candidate for single-source reporting. The solution was a sort of kludge called the operational data store. It violated many data warehouse principles, even though it was still considered part of the data warehouse. In time, operational data stores were also integrated, so they lost their capability to tie back to the original systems. Because of the transformation that happens with integration, they inherited much of the same latency as data warehouses.
The need for more timely analysis of data is strong and growing, however. The business intelligence industry is working toward solutions of operational business intelligence, but that definition is typically limited to daily or "intraday" data.3 So for the time being, operational business intelligence is characterized as having the same output (to people) and using the same data model (the data warehouse) but taking place a little more often than daily. True real-time data monitoring has been scaled back in business intelligence to a watery concept called "right time," which means as fast as the data warehouse can handle it. Some vendors can trickle data into a data warehouse, but the answer to operational business intelligence clearly lies in the ability to get closer to business processes as composite applications or from message queues. In any case, the lingering shadow of the data warehouse plus people-based delivery of business intelligence will keep business intelligence from participating in making your systems smarter, but it can be a valuable contributor.
Data Mining, Predictive Reporting, and Operations Research
To improve the value of reporting and analysis tools for knowledge workers, some organizations have adopted data-mining technologies. Data mining involves using powerful mathematical techniques to analyze huge volumes of data, as from a data warehouse, and extract meaningful insight. This insight might mean finding out how two products are related in their sales patterns or what customer characteristics make them susceptible to an offer. Association rules, likelihood scores, and trending are possible outcomes of these insights.
Although much data mining is aimed at understanding historical data, some is focused on predictive analysis. What customer actions predict whether they will cancel a contract? What kinds of customers are likely to respond well to a store redesign? Which suppliers will have credit issues in the future? This kind of predictive reporting can increase the value of analysis by giving knowledge workers a sense of data's future implications. Generally, it doesn't prescribe actions or lend itself to causing change in systems. Most organizations have applied it only to making reporting and analysis more sophisticated. Predictive reporting improves the value of your data but doesn't make your systems smarter.
Operations research grew out of work in the United Kingdom during World War II and has been applied to business problems more over the years. Using a mix of statistical and mathematical techniques to find solutions to problems, operations research is more focused on finding the right action to take, given current constraints and historical data. Using mathematical techniques and optimization and simulation tools, it allows a modeler to set up a problem and then "solve" it to find the best outcomes, given existing constraints. Although some models can be embedded into systems to make them smarter, the range of solutions for which it's sufficient is limited. Therefore, operations research has focused on making people understand possible solutions instead. Many problems have been targeted and solved with operations research, but it hasn't generally been used in programs that run organizations day to day.
Although data mining and operations research have roles to play in solving these problems, problems with data aren't the only ones organizations have. You need to consider problems with programs as well, as discussed in the following section.