- Views of Data
- A Brief History of Data Architecture
- Advanced Data Management·Meta-data
- Graphics·Data Modeling
- Using Entity/Relationship and Object Models
- Normalization
- Data Modeling Conventions
- Entity/Relationship Model Validation
- The Requirements Analysis Deliverable·Column One
- Data and the Other Columns
- Conclusion
A Brief History of Data Architecture
While this multifaceted view of data is not universally practiced, it is generally accepted as a reasonable way to look at the world. How we got here turns out to be an interesting story.
The “Application Approach” to Systems . . .
Originally, we wrote programs to address particular problems. From the beginning of the computer industry in the 1940s, and through the early 1970s, data processing professionals focused on program processing. From machine language, through assembler, Fortran, and COBOL, programmers were concerned primarily with what a program was supposed to do. Presumably that was in support of some activity in an enterprise. Any data referred to were defined as needed by the program and then promptly forgotten. For example, we might write a program for a client that would take Whiz-bang input data and produce a Framis report of some kind. (See Figure 3.3.)
Figure 3.3. One Program.
As it happens, the fellow sitting next to our client sees the report and likes it. He'd like one, too. But could you add this little dollop of Knart data, and maybe summarize things a little differently? This gives us the configuration in Figure 3.4.
Figure 3.4. Two Programs.
Now, our popularity grows. Lots of people want variations on the report, using not only the data originally provided but various other kinds of data as well. This gives us Figure 3.5.
Figure 3.5. Until . . . .
In fact it usually gives us a picture even messier than is shown in Figure 3.5. There could be hundreds of these programs, all copying data from one place to another—often requiring re-entry of the same data—with no particular organization or cohesion.
The problem is compounded by the fact that the business is dynamic. Our clients are always changing the way they want things to work. By the time the configuration has matured to this complexity, the situation is all but hopeless. Errors cannot be fixed and enhancements cannot be added without generating yet more errors.
What Went Wrong?
It is easy to bemoan the mess we have found ourselves in. It is easy to imagine that things simply got out of control by virtue of general incompetence. The fact of the matter, however, is that this approach was based on three very specific mistaken assumptions.
1 Input's Connections to Output
First, we assumed that input and output were closely related. (See Figure 3.6.) Each project was a matched pair of inputs and outputs. Take some data, process it, and generate a report.
Figure 3.6. The Assumptions.
The problem with that assumption is that it's false.
The dynamics of input are in fact radically different from those of output. Input data tend to come from the operations of the company. The mechanisms for collecting them are difficult to set up, but once in place they are relatively stable and often are difficult to change. Output requirements, on the other hand, are usually highly dynamic. “I want a report that looks like this. I have never asked for this kind of report before, and I never will again, but this is what I want today.”
When somebody wants a new report and is told that it can't be produced because the structures for collecting the data are too inflexible, you have an unhappy client.
2 Overlapping Applications
A second, related, assumption was that each of the applications was a separate entity. By assuming that applications could be addressed in isolation, we wound up duplicating:
-
Data definition
-
Data entry
-
Program code
. . . with the effect that we couldn't:
-
Guarantee consistency
-
Find errors easily
The fact of the matter is that much of the data that any one department requires is often required by other departments as well. For each and every department to pursue collection and massaging of the same data is not very productive.
3 Encoding Business in the Programs
A third assumption we made was that programs should represent the workings of the enterprise. This meant:
-
The structure of the company's business was encoded in the programs.
-
Changes in the business meant changes to programs (by programmers).
-
“Ease of maintenance” (for programmers) was an issue.
But processes are inherently subject to change. To encode the business in the programs was to condemn the Information Technology Department to be forever trying to keep up with business changes. The advent of structured design techniques was intended to address this problem, and it did make programs easier to change—but the fundamental problem remained. Programs could adapt to change, but they could not accommodate it.
It is these three assumptions, not a general undifferentiated incompetence, that left us in this sad position.
The Solution—Version 1
So, how should we address this problem and these fallacies?
In 1970, Dr. E. F. Codd introduced the relational approach to organizing data [Codd, 1970]. Dr. Codd's theory included specific criteria to ensure that data structures remained simple and non-redundant (“normalization”1). This approach produced much simpler database structures than had been seen before. Before that, COBOL programs arranged data hierarchically, as did many early versions of database management systems. This meant that to find a piece of data, it was necessary to start with general categories and search through progressively narrower ones.
Dr. Codd's insight was that you could do everything you needed to store data coherently with two-dimensional tables, or relations, as he called them. He based his theory on the mathematics of set theory, and from that he derived a set of rules to ensure that data were stored with the least possible redundancy. Moreover, relationships between data were defined by the data themselves, not by internal (and invisible) pointers.
Thus, four years before publication of the three-schema architecture, Dr. Codd had already described the nature of the conceptual schema.
Shortly after Dr. Codd's paper was published, the IBM San Jose Laboratory began work on System R as a database management system based on his relational theory. At about the same time, a group of students at the University of California at Berkeley built the Interactive Graphics and Retrieval System (INGRES), and Relation Software, Inc. was developing the Oracle Database.2
Six years later, in 1976, Peter Chen introduced “entity/relationship modeling” or “data modeling” as a way of representing data structures graphically [Chen, 1976]. Two years after that, Oracle Corporation introduced the first commercial relational database management system.
These insights and techniques drew our attention to the structure of the data our programs were manipulating. And the more we looked at things, the more we realized that these data structures were more stable than our program structures could ever be. The things an enterprise is concerned with don't change very much. The processes they carry out change all the time.
Thus, our orientation changed from the programs we wrote to the data they used. The advent of this data orientation made it clear that we had been looking at the problem all wrong. Instead of addressing the problem of a company's systems horizontally, why not address them vertically, as shown in Figure 3.7?
Figure 3.7. Cut a Different Way.
This means addressing the entire problem of collecting data in the organization as a single project, utilizing all the best technology available. This might include time clocks, process-control equipment, and bar code readers.
A second project can then address reporting overall. There are excellent tools for retrieving data in many different ways and forms. These include query languages, online analytical processing tools, and other fourth-generation programming tools.
The idea is to isolate these processes from each other. This is done by redesigning the middle. The secret of this redesign is the addition of a buffer database, as shown in Figure 3.8.
Figure 3.8. The Database Approach.
Included with this buffer is a standard piece of software called a database management system. Traditional programs tended to have a lot of code concerned with reading from and writing to files. A database management system is specifically equipped to read from and write to the database. Because the mechanics of reading and writing have been standardized, the only programming required now is that to:
-
define the structure of the input transactions, and the translation of input data into a form for storage
-
define the structure of the data as stored
-
define the nature of each query, in terms of the data as stored
An “application” is now just that required to enter or retrieve (and manipulate) data. (See Figure 3.8.)
Note that with this arrangement, the definition of the nature of the enterprise is no longer just encoded in programs. It is now reflected in the structure of the data. The task of designing the database is nothing other than the task of defining the structure of the enterprise itself.
The database management system is then charged with distinguishing the external views of data from the conceptual and physical views. Indeed, relational database management systems, via the standard language SQL, have the ability to represent “views” of data that are different from the physical tables and columns that constitute the database.
All this, then, constitutes “the database approach” to designing systems.
Data Management
A significant effect of this change in orientation is the appearance of “data management” as a function. Companies and government agencies have come to recognize that information itself constitutes an important asset to the company, and throughout the nineties the functions “database administrator” and “data administrator” began to appear.
The meanings of these two terms have varied widely but appear to be converging on the following: the database administrator is the person responsible for maintenance of the database management software, while the data administrator (also called the information resource manager, or IRM) is the person responsible for the integrity of the data themselves. The data administrator is responsible for data models as well as for the quality of the data captured.
Professional societies such as the Data Administration Management Association (DAMA)3 have been formed to further the body of knowledge concerning data administration, and numerous books have been written on the subject.
The Solution—Version 2
The database approach to developing systems has been undertaken in various forms since the late 1970s. Relational database management systems in particular now make it possible to design a database along the lines of a conceptual schema and then provide different “views” of the database to both the creators and consumers of its data. In addition each database management system has the ability to tune physical storage parameters independently of the table and column structure. The principles of data “normalization” (described below) provide guidance for the design of the central database, while SQL views are used to construct specific external schemas.
In spite of these advances, however, different companies have had varying degrees of success in trying to carry out this approach, and few have been able to do so completely. It turns out that there are other problems besides the ones we first identified.
The first problem relates to the first assumption the database approach was supposed to deal with previously. The way creators and consumers look at data is very different. (See Figure 3.9.)
Figure 3.9. Inputs vs. Outputs.
Creators of data tend to deal with it one transaction at a time. It is not impractical to analyze a transaction and figure out where each of its components fits into the database based on the conceptual schema.
Consumers of data, on the other hand, tend to want to see large sets of data, with many records taken from various parts of the underlying database. As powerful and fast as computers have become in recent years, they still are not up to the task of retrieving data as complicated as those typically requested, in the time required.
The response to these problems has been development of the central database into what has come to be called the data warehouse. This term has two meanings: First, it is a configuration consisting of a central database that reflects an enterprise's conceptual schema, plus one or more data marts, each of which represents a particular external schema—summarizing data in particular ways that will be useful to a particular data customer. The second meaning is that a data warehouse is the central database itself. It is an integrated, enterprisewide database that serves as a central point of reference. It's purpose is to store data for retrieval, not to support operations, so it will not be quite as current, and some data in it may be stored in summarized form.
It is this second meaning of the term that will be used here. A data warehouse is analogous to a discount warehouse of the sort run by Costco or Sam's Club. Here, the pretzels are stored in one part of the store, the beer is in a second part, and the video tapes are in a third part altogether. At the local convenience store, however (the mart), the pretzels, beer, and video tapes tend to be displayed close together, to encourage customers buying one to buy them all. This configuration means that most queries will not have to examine the entire data warehouse at all.
Mapping software extracts data from the central database to create these data marts. (See Figure 3.10.)
Figure 3.10. The Data Warehouse.
The data marts each tend to be organized in terms of a central fact, such as “sales”, with various dimensions for collecting groups of facts. A dimension could be identification of the fact's product and product group, its point in time, or geographical location. A collection of facts, then, is expressed in terms of the product group, time, and geography to which they apply. (“Give me all the sales for green vegetables in the Eastern Region last quarter.”) This organization of data is also called a multidimensional database.
For example, while a data warehouse (in the sense of the central database) might keep the complete structure of customer information and contracts, one multidimensional data mart might just concern the fact “sale”, accessible through the dimensions of time (day, week, month, or quarter), geography (region, state, city), customer (customer group or individual customer), and product (product group or product). In such a data mart, it would be easy to report on sales at any level of detail along any dimension.
One problem with early incarnations of the database approach was that a typical company's management was not prepared to replace all its existing systems and base the entire company's operations on a single database. Older systems are not replaced if there seems no need to do so.
The data warehouse architecture, then, often involves feeding the central database from existing systems (legacy systems). This means additional mapping software to translate data from the terms in which they were stored there into terms appropriate for the data warehouse. These are shown on the left side of Figure 3.10.
The result of all this is a configuration that may include a set of older systems, a “mapping” from those systems to a central data warehouse, and then another “mapping” to the data marts used for user queries.
One or more of the systems which support the organization's daily operations may also be based on the conceptual schema. To the extent that this is done and the operational system comes to resemble a data warehouse, the database becomes what is called an operational data store, or ODS. The dynamics of a data warehouse and an operational data store remain very different, however. To the extent that a database is an ODS used to support applications, it will be more dynamic, with potentially less room for history. To the extent that it is used only as a warehouse to support retrievals and analysis, it may have more history and be updated less frequently. In either case, however, the underlying structure will be basically the same.
Since the data warehouse is used only for retrievals, it does not require the structures and mechanisms for enforcing business rules that are to be found in the operational data store and in other operational legacy systems.