- Views of Data
- A Brief History of Data Architecture
- Advanced Data Management·Meta-data
- Graphics·Data Modeling
- Using Entity/Relationship and Object Models
- Normalization
- Data Modeling Conventions
- Entity/Relationship Model Validation
- The Requirements Analysis Deliverable·Column One
- Data and the Other Columns
- Conclusion
Using Entity/Relationship and Object Models
The definition of a data model says only that boxes represent things and lines represent relationship between pairs of things. The definition doesn't say anything about how a data model is to be used. In fact, the meaning of the boxes and lines can be very different in three of the Architecture rows we've been talking about.
Figure 3.1 shows the Architecture Framework with the Data column highlighted. Three perspectives have a particular interest in data or object models:
-
Business owners are interested in seeing representations of the tangible things of their business. They see things as being related to each other in complex ways. They view the data's external schema.
-
Architects are interested in the underlying structure of data. For them all relationships are binary and one-to-many. The principles of normalization have been applied. They are concerned with the conceptual schema.
-
Designers are interested in representations of the implications of data structure on the physical design of databases, whether they be relational, object oriented, or something else. They are interested in the logical schema that is appropriate to the technology they are using.
The graphic structure of data modeling is the same in all three rows, although, as described in Appendix B, different notations are better for the different views. Chen's notation is suitable for Row Two models, for example, since it can describe multi-way relationships. Oracle's notation works well for Row Three models, since it has minimal extra symbols for things not needed there. Object-role modeling can also be used for both Row Two and Row Three. In a relational environment, IDEF1X works well as a design technique, while object-oriented designers favor the UML. The following sections discuss this issue in more detail.
Business Owners' Views (Row Two)
The most important element of the business owners' views to capture is language. The first artifact to be created in any analysis project should be a glossary. This captures every technical, industry, and other specialized term and its definitions. If the same word is used in multiple ways, work with people to come up with a single definition, or, if that isn't possible, document the disagreement.
Once the terms have been captured, capture the facts of the enterprise. These are constructed from the terms in the glossary, and for the most part they can be represented in data models. However they are represented, the models prepared for business owners must be clear to them, in their language. In principle, any of the modeling notations could be used here, but since these models will be discussed with nontechnical people, it is desirable to use a technique whose aesthetic qualities make it attractive and easy to understand. In Row Two, the business owner's view, the entity types (classes), attributes, and relationships (associations) represent external schemata—business objects. The notation for such a model may allow multiway relationships, many-to-many relationships, and multivalued attributes. The important thing about this model is that it represents exactly the things the people in the business see. There is no diagrammatic rigor to the model, but it is important to use language precisely and clearly.
The business owners' models are also called divergent data models, because they constitute a diverse set of entity types.
An entity type could be a PURCHASE ORDER or a VENDOR, each of which is actually a combination of entity types and relationships. Multivalued attributes are permitted, such as the “line items” contained in a PURCHASE ORDER. Many of the relationships will be “many-to-many”, and relationships may be portrayed that are not binary. (A PROFESSOR teaches a COURSE in a CLASSROOM, for example.)
The aesthetics of the diagrams are important, since such people will have little patience for learning arcane diagraming conventions.
Architect's View (Row Three)
Models prepared for information architects (information system designers) are more disciplined. In Row Three, the architect attempts to identify underlying structures. At the syntactic level, all multiway relationships are transformed into sets of binary relationships. All many-to-many relationships between entity types are transformed into intersect entity types that represent occurrences of associations between the two entity types. All multi-valued attributes are converted into additional entity types, according to the rules of “normalization” (described below). Following these disciplines insures that the true natures of the data are really understood. In addition, at the semantic level, this model is expressed in terms of the most fundamental things of the business. What the business owners see may be but examples of these fundamental things. For example, business owners are usually conscious of VENDORS, CUSTOMERS, EMPLOYEES, and the like. In the architectural model, these are replaced with PERSON, and ORGANIZATION with a super-type called PARTY, where a PARTY is defined as either a person or an organization of interest to the enterprise. These are then related to each other, contracts, and other things, in order to show their roles as customers, vendors, employees, and so forth.
This means that entity types in architects' models may well be combinations of entity types in business owners' models. This is called a convergent data model, because the diverse entity types of the divergent models have been consolidated (“converged”) into a smaller number of more fundamental entity types.
Again, in principle, any of the modeling notations could be used for this kind of model, but, since these models also will be discussed with non-technical people, they should be as aesthetically clean and easy to understand as possible. Remember, it is the clients who ultimately must ensure that any assumptions made while creating either model were in fact true.
In addition to the simple resolution of anomalies in the context of a particular area, the architect also reaches out to other spheres of interest, to create a model that extends beyond the immediate environment. This means, for example, that what may appear to be a one-to-many relationship in the context of one department is really a many-to-many relationship when all departments are considered.
Designer's View (Row Four)
The set of boxes and lines that constitutes a data model's notation may also be used to represent the things in the designer's view. A designer sees a data model as an expression of computer artifacts. Specifically, what you see in Row Four depends on the technology you will be using: A relational designer sees tables, columns, and foreign keys; an object-oriented designer sees classes, attributes, and associations to be navigated. What is represented here are no longer things in the business but things in the computer.
Aesthetics are not as important to the designer as they are to the architect or the business owner. The designer likes to see more details in the diagram than do audiences of either of the other two kinds of models. Hence, these models may be more cluttered and complex.
It is here that the data model and the object model are used quite differently. The logical schema the designer uses depends on the database management system and development technology being used. If the implementation is to use relational technology, the boxes in the diagram represent tables, with variations on the technique representing foreign keys and other relational structures. IDEF1X is particularly suited for this.
Alternatively, the boxes can represent an object-oriented programmer's object classes, with additions to the notation for certain object-oriented constructs, such as composition and association navigation. The UML does this well.
Note, for example, that relational developers and object-oriented programmers view relationships quite differently in the design model. A relational database relates tables by associating matching columns. That is, a relationship represents a structure that is, by definition, mutual. If A is related to B, then by definition B is related to A.
In object-oriented programming, however, a relationship represents the two navigation paths from each class to the other. Where in a relational database a relationship simply asserts that two tables could be joined together in an SQL statement. In an object-oriented environment, a relationship means that program code implementing the behavior of each class will be used to implement one or both interactions between them.
Scott Ambler, for example, in a 1999 white paper, “Mapping Objects To Relational Databases” (http://www.ambysoft.com/mappingObjects.html), describes a model of ADDRESS and POSTAL AREA.5 In the world, and therefore in an analysis model, each ADDRESS must be in one POSTAL AREA, while each POSTAL AREA may be the location of one or more ADDRESSES, as shown in Figure 3.17.
Figure 3.17. Analyst's ERD.
Note that the relationship's roles are named in both directions. The relationship from POSTAL AREA to ADDRESS is optional, since it is possible that a POSTAL AREA may have no addresses in it. For purposes of this exercise, we have stipulated that each ADDRESS must be in one POSTAL AREA, even though in the real world it may be the case that some addresses are specified without the POSTAL AREA, especially outside the United States.
This model, then, presupposes that any ZIP code specified for an ADDRESS will be validated by comparison with the ZIP codes for POSTAL AREAS in the reference entity type. It also envisions both a query as to the POSTAL AREA of an ADDRESS and of the ADDRESSES in a POSTAL AREA.
In Mr. Ambler's version of the model, however, there is not a “requirement” for the relationship to be documented in both directions. In his view, POSTAL AREA only exists to provide the behavior for validating a ZIP code attached to an ADDRESS. This validation, however, is limited to what can be inferred from a format and from the “City” the ADDRESS is in. It does not extend to examination of a master list of ZIP codes, and there certainly isn't any requirement to be able to go from a POSTAL AREA to the ADDRESSES in that POSTAL AREA. For these reasons, his model is much more constrained than the analysis entity/relationship diagram. Figure 3.18 shows the entity/relationship diagram of this design version of the problem. (The arrow is an annotation only. It is not an official part of the model.)
Figure 3.18. OO Modeler's ER Model.
In Mr. Ambler's version, the model will only be navigated from ADDRESS to POSTAL AREA, so there is no need to specify the relationship from POSTAL AREA to ADDRESS. Since graphically there is no way not to specify the relationship in that direction, it is shown here as a “must be one and only one” relationship, without a meaningful name. Indeed, if the ZIP code is specified uniquely every time an ADDRESS is added, the relationship is in fact one-to-one, and it is mandatory. (Note that this means the same ZIP code can appear more than once in POSTAL AREA.)
In Mr. Ambler's example, the entity/relationship diagram in Figure 3.18 is not adequate to represent what is needed for design. UML is much more suited for this. Figure 3.19 shows the UML version of this model. Here we can see the arrow showing the only navigation direction. We can also see the names of the programs that implement each entity type's behavior. (OK, cardinality from POSTAL AREA to ADDRESS is shown, but your author doesn't know why.) The key program for this discussion is “validate” in POSTAL AREA. This is a program that will check the format of the code and check for consistency between the code and the state specified. (In the U.S. the first two digits of a ZIP code determine the state where the POSTAL AREA is located.)
Figure 3.19. OO Modeler's UML Model.
This is clearly a design model. The notation has details (attribute formats, for example, and symbols qualifying attributes and behaviors) that are of interest to designers, but that are not of interest during analysis. It does not have relationship role names. In addition, as we have seen, its content differs from the original analysis model because of economic evaluations that were performed to constrain the design. The company cannot afford to buy a ZIP code master file (either from a financial or a logistical point of view), and it currently has no interest in locating addresses by ZIP code, so the design is modified to recognize that. Mr. Ambler says that this constrained approach is what is “required” by the business, but in fact it is only required because circumstances forced design to be something less than what is envisioned by the conceptual model. After all, applying economics to the conceptual model is what design is all about. The validation in this design is weaker than might be possible, but the company can decide to accept that. This is the sort of economic trade-off that designers do all the time.
Note, however, that circumstances may change in the future, and the ZIP code file may be deemed a good idea. The company may decide at that time that it wants to see all the addresses in a particular POSTAL AREA. Then it would be valuable to have constructed the conceptual model correctly, so you can see how the design model varied from it, and what now must be changed.