- Views of Data
- A Brief History of Data Architecture
- Advanced Data Management·Meta-data
- Graphics·Data Modeling
- Using Entity/Relationship and Object Models
- Normalization
- Data Modeling Conventions
- Entity/Relationship Model Validation
- The Requirements Analysis Deliverable·Column One
- Data and the Other Columns
- Conclusion
Graphics—Data Modeling
A Short History
Computer programs are fundamentally abstract products. In the mid-1970s people began to realize that these abstractions could benefit from more graphic representations. Flow charts led to data flow diagrams on the process side. And in 1976, entity/relationship modeling was invented to represent data structures graphically.
Entity/Relational Modeling
Peter Chen first introduced a graphic approach to relational design with entity/relationship modeling (a kind of data modeling) in 1976 [Chen 1976–1977]. Based on Mr. Codd's idea of associating simple groups of data, entity/relationship modeling used a box to represent a relation, and a diamond and lines to represent a relationship between relations. This simple idea has revolutionized the way we represent data structures. It began essentially as a graphic technique for modeling relational data, but it subsequently evolved to representation of a particular nonrelational structure as well (“super-type” and “sub-types”, described below).
While originally data modeling was supposed to depict relations, it quickly became clear that the boxes really represented the things in a business about which it wished to hold information. They represented entity types. An entity type is the definition of something of significance to an organization, such as “person”, “product”, or “activity”. An entity type has attributes and relationships with other entity types.
Indeed, data modeling describes the structure of a business itself.
Thus, a data model reflects the structure of an enterprise as well as the potential design of a database that would similarly reflect that structure of the enterprise. And it turns out that while a company's processes may change, its fundamental (data) structure rarely does. Base a system design on the structure of a company's data and you have a system that promises to last more than a few months.
Mr. Chen's approach to entity/relationship modeling was only a first version, however, and many people since then have tried to improve upon it. A veritable plethora of data modeling notations have emerged.
Clive Finkelstein invented a different data modeling notation to use as the basis for his idea of information engineering. Not just a new modeling technique, information engineering is a comprehensive approach to the whole process of system development. It is based both on a data-centric approach to analysis and on the system development life cycle (described in Chapter 2 of this book). It portrays the development process as progressing from strategic planning and requirements analysis through design, construction, and implementation. Mr. Finkelstein and James Martin first published this in 1981 in the article “Information Engineering” in the Savant Institute's Technical Report. Later Mr. Martin popularized this approach through various books, and Mr. Finkelstein finally published his version in 1989 as An Introduction to Information Engineering: From Strategic Planning to Information Systems.
Messrs. Finkelstein and Martin added to the relationally oriented data model the concept of sub-type and super-type. This concept is derived from the idea of inheritance, first identified for object-oriented programming in the 1960s. It is a representation of the fact that occurrences of a thing of significance may themselves be categorized. An ORDER, for example, may be either a SALES ORDER, a PURCHASE ORDER, or a WORK ORDER. This is not a structure that can be directly represented in a relational database without serious translation. But it is a powerful data modeling concept for representing things in the business.
Subsequently, several other data modeling notations were developed. Among these were IDEF1X, primarily used by the United States Defense Department [Bruce, 1992], and one developed by the British consulting firm CACI. The latter was further developed by Richard Barker and the Oracle Corporation and is also used in Europe as part of the Structured Systems Analysis and Design Method (SSADM). Among other things, this modeling method introduced a discipline for naming relationships—a discipline that makes it possible to read relationship names as simple natural language sentences.
Object Modeling
While all this was going on in the world of business requirements analysis, programmers also discovered that organizing their programs around data had great advantages. Since the mid 1960s, object-oriented programming has been taking over the world of writing programs. Where programs were originally organized around what they did, with data being attached where necessary, the object-oriented programmers organized their work around the data—or rather, they organized it in terms of the objects the data described. This turned out to be a particularly powerful way to look at the problem—especially in the realm of real-time systems. It burst upon the general data processing scene in a big way in the 1980s, when the advent of graphic user interfaces introduced object-oriented programmers to the world of commercial applications. They realized that the problem of defining requirements in this world could benefit from their insights into the world of objects. Without appreciating that commercial systems analysts themselves had already discovered the modeling of “things of significance to the business”, they introduced the idea of object models.
Thus, while the systems analysts were becoming data oriented through information engineering, the programmers were becoming data oriented through object-oriented programming.
In 1988 Sally Shlaer and Stephen Mellor wrote Object-Oriented Systems Analysis: Modeling the World in Data, and thus introduced object-oriented analysis to the world at large. They acknowledged that information engineering had developed a modeling technique that was singularly compatible with the object-oriented view of things. They invented their own notation for it, however (adding to the four or five that already existed), thus introducing the notion of “object modeling” to the world of commercial applications.
In fact, the symbols in Ms. Shlaer's and Mr. Mellor's books corresponded exactly to those in information engineering. There were entity types (now called object classes) and relationships (called associations). The relationships had a single arrow when referring to a single occurrence of a class and a double arrow when referring to multiple occurrences of a class. Each relationship was assumed to be mandatory, unless marked with a “C” for “conditional”.
Then, in 1991, James Rumbaugh, Michael Blaha, William Premerlani, Frederick Eddy, and William Lorensen published Object-Oriented Modeling and Design. This book presented another object modeling notation, along with a methodology called the “Object Modeling Technique”, or OMT. The methodology included variations on the information engineering concepts of analysis, system design, “object design”, and implementation. It modified the information engineering approach to the extent that it was much more tolerant of iteration in the system development life cycle.
After 1991, several other authors tried their hands at developing object-modeling techniques. Ed Yourdon and Peter Coad wrote Object-Oriented Analysis in 1990. David Embley, Barry Kurtz, and Scott N. Woodfield wrote Object-Oriented Systems Analysis: A Model-Driven Approach in 1992. So now, in addition to the various entity/relationship modeling notations available, there were as many or more under the rubric of object modeling.
In 1997 the experts of the object world got together and devised the Unified Modeling Language (the UML), to replace at least all of the object-modeling notations [OMG, 1998].
Where it was concerned only with business objects (tangible things seen by people in the enterprise or intangible things commonly understood by people in the enterprise), object modeling essentially represented the same things as the data modeling techniques available at the time. It had the advantage, however, of being from the “object” world, which made object-oriented developers more comfortable with its ideas. As it happened, the ERD concept of inheritance from super-type to sub-types had been developed in the context of object-oriented programming in the 1960s.
This doesn't mean that object modeling is not useful to support object-oriented design specifically. By portraying computer objects an object model can be very helpful in clarifying the implications of particular designs. And the model of an object oriented design may be very different from the corresponding model of relational design.
For example, object-oriented associations are treated differently from relational relationships in design. Where a relationship implemented in a relational database remains structural, consisting of a foreign key, in an object-oriented program it is part of a class's behavior. A piece of program code is required to implement the navigation from each class to each other class. Separate pieces of code are required for the two directions of a relationship.
Object-oriented programming does combine objects with their behavior, bringing together data and process in a very effective way. In the context of object-oriented programming, behavior describes the program which implements the behavior of an object along with the description of the object itself. Mr. Rumbaugh et al. extended this to allow for describing the behavior of business object classes as well. This did not, however, affect the structural aspects of the technique.
While the idea of incorporating behavior in analysis data models is appealing, it turns out to be problematic. Where, in a program, behavior is always described by program code, the behavior of real-world objects is much more complex. Since the objects in object-oriented programming are themselves pieces of code, it is reasonable to package with them the code that describes their behavior. Objects in the world, however, that are the subject of requirements analysis behave in ways that are far more varied and complex than can be described by a bit of pseudocode. Describing the complete behavior of a person, for example, has taxed the skills of novelists for generations.
The best way to describe business object behavior appears to be entity life histories, a technique described in Chapter 7.
The net effect of all this, even with the consolidation under the UML, is that there are still many ways to model the structure of data. Appendix B presents the most significant notations in detail and compares them.
Regardless of the symbols used, data modeling during analysis (whether of the object or entity/relationship kind) is intended to do one thing: describe the things about which an organization wishes to collect data, along with the relationships among them. For this reason, all of the commonly used systems of notation fundamentally are convertible from one to another. The major differences among them are aesthetic, although some make distinctions that others do not, and some lack symbols to represent all situations.
Object-Role Modeling (ORM)
Another approach to modeling data was developed by G. M. Nijssen in the 1980s. This is significantly different in its approach from either entity/relationship or object modeling. It was originally called NIAM, an acronym for “Nijssen's Information Analysis Methodology”. More recently, since G. M. Nijssen was only one of many people involved in the full development of the method, it has been renamed “object-role modeling”, or ORM. ORM takes a different approach from the other methods described here. Rather than representing entity types as analogs of relational tables, it shows relationships (“roles” in ORM parlance) to be such analogs. Like Mr. Barker's notation, it makes extensive use of language in making the models accessible to the public, but unlike any of the other modeling techniques, it has much greater capacity to describe business rules and constraints.
With ORM, it is difficult to describe entity types independently from relationships. The philosophy behind the language is that it describes “facts,” where a fact is a combination of ORM objects: entity types, value types, and roles.
In ORM, an entity type is portrayed by an ellipse (often a circle, actually) containing its name. An ellipse can also represent a value type, which is effectively a domain. A value type describes a kind of data, like “date” or “amount”, and can be related to entity types just like other entity types.
Relationships between pairs of objects are shown as pairs of roles. A role is the half of a relationship reading in one direction. In other words, entity type 1 plays a role with respect to entity type 2. Entity type 2 plays a complementary role with respect to entity type 1. A fact is the playing of a role by an entity type. Figure 3.11 shows that each relationship between entity types represents two facts. For example, each Event must be an example of one and only one Event Type, and each Event Type must be embodied in one or more Events.
Figure 3.11. An ORM Model.
In fact what would be an “attribute” of an entity type in an entity/relationship diagram, in ORM is simply a role played in a relationship. It is intentionally not clear initially whether the attribute is pointing to a value type or another entity type.
The dot where a relationship is attached to an entity type means that for each occurrence of the entity type, there must be at least one occurrence of the role involved. The double-headed arrow over one or more roles means that there can be no more than one occurrence of the roles so marked; they are unique. For each occurrence of the entity type, there can be no more than one occurrence of the role.
Object role modeling is described both in Appendix B and in Chapter 8 of this book. Chapter 8 describes ORM's approach to business rules. ORM is fully documented in Terry Halpin's 2000 book, Information Modeling and Relational Databases.
For example, the dot on Event makes it mandatory that each Event must be an example of at least one Event Type. The double-headed arrow over “an example of” means that there can be no more than one occurrence of this role for each Event. Absence of a dot next to Event Type makes it optional whether there are any events for that Event Type, and the absence of a double-headed arrow over “embodied in” means that the role can occur any number of times for an Event Type.
When it is related to another entity type, the role it takes is an attribute of the first entity type. Figure 3.11 shows an example: Event and event type are entity types. “Event description” and “event type description” appear to be attributes, but this is left intentionally ambiguous. Indeed, an attribute is treated as an object, just like an entity type. Event Type Description appears here to be a value type (domain). It is conceivable—albeit unlikely—that “Event Type Description” could turn out to have attributes of its own, thereby becoming an entity type. If this were an entity/relationship diagram, and “event type description” were initially shown as an attribute of Event, later, when it turned out that it had an attribute of its own, the model would have to be restructured to make Event Type an entity type. Here, it doesn't matter. The relationship between Event Type and Event Type Description is independent of whether or not there are attributes of Event Type Description.
An entity type label (an attribute which is a sole identifier) may be shown with a dashed ellipses around a value type, although as a shorthand, it also may be shown within the entity type ellipse in parentheses, below the entity type name. In the example, “(ID)” is the identifying attribute for event, and “(Code)” is the identifying attribute of event type.
Relationships not only connect entity types to each other but also connect value types to entity types. ORM is unique in being able to graphically show the optionality and cardinality of attributes. That is, is the attribute mandatory? Can it take more than one value?
Where most methods portray entity types in terms that allow them to be translated into relational tables, ORM portrays the relationships so that they can be converted to tables. That is, the two parts (or more) of the relationship become columns in a “relation” (table). In effect, these are the foreign keys to the two entity types. Attributes of one or more of the related entity types also then become part of a generated table.
In Figure 3.11, for example, a table would be created from the an example of / embodied in relationship, with columns for “Event (code)” and “Event Type (code)”. Another table would have the columns “Event Code” and “Event Description”, and yet another would have “Event Type Code” and “Event Type Description”. This is shown in Tables 3.1.
Table 3.1. Relationship Tables
ORM represents a unique way to create conceptual data models for Row Three of the Architecture Framework.
How to Draw a Data Model
As stated above, Appendix B contains a comparison of the various data and object modeling techniques currently available. It compares their syntax, aesthetics, and appropriateness to each of the audiences described here.
Because this book is concerned with requirements analysis—the translation of a business owner's view to an architect's view—the data modeling technique used here will be the one that seems most suitable for presenting to an audience of business owners. It is graphically the simplest, with the fewest different kinds of symbols to learn.
The technique was developed by the British consulting company CACI and then made part of Oracle Corporation's approach to modeling. It was brought into its final form and publicized by Richard Barker in his 1990 book, CASE Method: Entity Relationship Modelling. It has since been made part of the European methodology, “Structured Systems Analysis and Design Method” (SSADM). (More information about SSADM can be found in Malcolm Eva's 1994 book, SSADM Version 4: A User's Guide Second Edition.)
In this notation, two major types of symbols are used in an entity/relationship diagram:
-
A rectangular box with rounded corners, which represents an entity type
-
A line, each half of which may be dashed or solid, which represents a relationship between two entity types
Other elements of the diagram may add additional information.
Entity Types
An entity type is a person, place, event, thing, or concept that has characteristics of interest to the business and is uniquely identifiable. It is something about which data are kept.
Examples of four entity types are shown in Figure 3.12.
Figure 3.12. Entity Types.
Note that, as originally defined, what you see here are really entity types or entity classes. Each is the definition of a TIMESHEET ENTRY, a PERSON, an ORGANIZATION, or an ACTIVITY. In the original terminology, a particular person, organization, timesheet entry, or activity is an entity. That is, originally, the definition was called an “entity type”, and the occurrence was called an “entity”, but the language has since been corrupted. Now entity often refers to the definition, and an instance of the thing is called an occurrence. This book will retain the original definitions of “entity type” for the definition and “entity” for the occurrence. In the object-oriented world, the occurrence is called an object, while the entity is called an object class (or simply class.)
Sub-types and Super-types
When the occurrences of an entity type fall into two or more subcategories, the subcategories can themselves become entity types which are then called sub-types of the first entity type. The first entity type is called a super-type. For example, in Figure 3.13, PERSON and ORGANIZATION are examples of the larger class called “PARTY” (as in “party” to a contract). That is, each PARTY (a super-type) must be either a PERSON or an ORGANIZATION (a sub-type), each PERSON must be a PARTY, and each ORGANIZATION must be a PARTY.
Figure 3.13. Sub-types and Super-types.
Note that there is a rule with this notation that each occurrence of the super-type may be an occurrence of only one sub-type, and it must be an occurrence of one of them.
Relationships
A relationship is a pair of named associations between entity types. In an entity/relationship diagram, a relationship is represented by drawing a line between two entity types.
The cardinality constraint—the assertion that an occurrences of an entity type may be related to no more than a specified number of occurrences of another entity type—is represented on a diagram by the presence or absence of a “crow's foot” symbol next to the second entity type. If there is no crow's foot, an occurrence of the far entity type may be related to no more than one occurrence of the adjacent entity type. If there is a crow's foot, there is no limit to the number of occurrences to which the occurrence of the far entity type may be related.
The optionality constraint—the assertion that an occurrence of one entity type must be related to at least one occurrence of another entity type—is represented by the half-line next to the first entity type being solid or dashed.
In most notations, there are no real rules about the naming of relationships. People generally use verbs, but with no particular system regarding the way they use them. Cardinality and optionality are not expressed in the resulting sentences, except by saying such things as, for example, “A PERSON has zero, one, or many ASSIGNMENTS”. The Barker notation, however, uses a convention that both adds additional discipline to the names and makes them more readable as normal English sentences.
Relationship sentences may be constructed from the symbols in the Barker technique as follows:
Each
<first entity type name>
must be (solid line from first entity type)
(or)
may be (dashed line from first entity type)
<relationship name>
one and only one (single line into second entity type)
(or)
one or more ("crows foot" into second entity type)
<second entity type name>,
Examples of two relationships are shown in Figure 3.14.
Figure 3.14. Relationships.
The top relationship in Figure 3.14 is represented by the following two sentences:
-
Each TIMESHEET ENTRY must be by one and only one PERSON.
-
Each PERSON may be the source of one or more TIMESHEETS.
The lower relationship in Figure 3.14 is represented by the following two sentences:
-
Each TIMESHEET ENTRY must be for one and only one ACTIVITY.
-
Each ACTIVITY may be charged with one or more TIMESHEET ENTRIES.
Attributes
An attribute is the definition of a characteristic, quality, or property of an entity type. It describes an entity type. In an entity/relationship diagram, an attribute of an entity type names and defines a characteristic, while an occurrence of an entity type provides a discrete value for the attribute.
Examples of attributes may be shown as text in the entity type boxes in Figure 3.15. Note that any attribute of (or relationship to) a super-type is inherited by each of the sub-types. In the example, “Party ID”, an attribute of PARTY, is also an attribute of PERSON and of ORGANIZATION. This does not go the other way, however. “Family Name” is an attribute of PERSON only and therefore is not an attribute of either PARTY or ORGANIZATION. Similarly, TIMESHEET ENTRY is related to PERSON only, and not to ORGANIZATION.
Figure 3.15. Attributes.
When presenting data models, you may chose not to show the attributes. They tend to clutter the diagram unnecessarily if you are presenting a model for purposes of discussing structure only. On the other hand, some would like to see attributes to clarify the meaning of each entity type.
Note that on this diagram, attributes are preceded by different symbols.
-
The open circle () means the attribute is optional—it may or may not be given a value for a particular instance of the entity type.
-
The asterisk (*) means that it is mandatory.
-
The pound sign (#) means that it is part of the “unique identifier” of the entity type. That is, the value of that attribute at least partially identifies different occurrences of the entity type—distinguishing them from each other.
The entity types PARTY and ACTIVITY are examples of reference entity types or independent entity types. Occurrences of them do not require occurrences of any other entity type. The entity type TIMESHEET ENTRY is an example of a dependent entity type, because an occurrence of it can only exist if it is related to occurrences of PERSON and ACTIVITY. 7The entity type TIMESHEET ENTRY is also an example of an intersect entity type, sometimes called an associative entity type. This is because its sole purpose for existing is to relate other entity types to one another.
PARTY is also an example of a reference entity type. A reference entity type is an entity type that usually refers to something tangible in the enterprise (in this case, a PERSON or an ORGANIZATION). It may also refer to a quality (like COLOR) that is used to describe another entity type, or to a classification for other entity types, such as ACTIVITY TYPE or CONTRACT TYPE. Usually, a reference entity type is also an independent entity type. By contrast, a transaction entity type (always a dependent entity type and usually an associative entity type) refers to something that occurred at a particular time.
All independent entity types are reference entity types, while nearly all reference entity types are independent. All dependent entity types are transaction entity types, and all transaction entity types are dependent on reference entity types. It is possible for a reference entity type to be dependent on another reference entity type.
Some of the attributes in Figure 3.15 are in parentheses. This means that the value for that attribute is computed from other attributes. For example, “(Value)” in TIMESHEET ENTRY is computed as “Hours” in TIMESHEET ENTRY, times the “Charge Rate” in PERSON. “(Labor Cost)” in ACTIVITY is the sum of “(Value)” across all the TIMESHEET ENTRIES that an ACTIVITY is charged with.
Note that any notation may be used for derived attributes, but it is constrained by the process that will be used to convert attributes in entity types to columns in tables. Parentheses are not usually permitted in table names. If this conversion will be done automatically, the convention used here will not work. In that case, simply precede the attribute name with a “c”.
Note also that the definition of the calculation does not show on the model. This must be documented behind the scenes.
Unique Identifiers
The first rule for specifying an entity type is that each of its occurrences must be unique. Duplicates are not permitted. An occurrence is made unique through some combination of attributes and relationships. By the end of the requirements analysis project, each entity type in the model must be assigned at least one unique identifier.
In the sample model in Figure 3.16, each occurrence of PERSON is made unique by a value for “ID”. This is shown by the octothorpe (#)4 in front of “ID” Similarly, COURSES are uniquely identified by values for “Course number”.
Figure 3.16. Unique Identifiers.
A COURSE OFFERING, on the other hand, requires a value for “Offering date” plus identification of the COURSE that the COURSE OFFERING is an occurrence of. This is shown by the octothorpe in front of “Offering date” and the short line across the occurrence of relationship. A CLASS ENROLLMENT, meanwhile, can only be identified by identifying the PERSON enrolling and the COURSE OFFERING enrolled in. This is shown by lines across both relationships.
In the original relational theory, each entity type's unique identifier should be a combination of natural attributes. In practice, however, it is difficult to assert that any attribute is absolutely immutable—a requirement if it is to be part of a primary key. For this reason, it is common to assign a surrogate key at least for reference entity types. This is a sequence number that has no intrinsic meaning. It is simply incremented for each new occurrence to ensure that that occurrence is unique. In Figure 3.16, “ID” is a surrogate key to PERSON.
If all reference entity types (those that have no mandatory relationships to any other entity type) have surrogate keys, it is still reasonable to let natural keys serve as identifiers of the relationships to such reference entity types. It is not necessary to have a surrogate key, unless it is important to allow multiple occurrences of each combination the entity type describes. In Figure 3.16, for example, using the two relationships to identify occurrences of CLASS ENROLLMENT means that no PERSON may be enrolled in the same COURSE OFFERING more than once. If you want to allow such duplication, it is necessary to add a “Sequence number” as part of the identifier.