- Views of Data
- A Brief History of Data Architecture
- Advanced Data Management·Meta-data
- Graphics·Data Modeling
- Using Entity/Relationship and Object Models
- Normalization
- Data Modeling Conventions
- Entity/Relationship Model Validation
- The Requirements Analysis Deliverable·Column One
- Data and the Other Columns
- Conclusion
Normalization
Dr. Codd's relational theory includes a specific set of rules for organizing data, if they are to be considered relational. These rules are expressed in terms of relations (two dimensional arrays of values), tuples (rows) and attributes (the definitions of columns in a relation). Databases supporting the relational theory express relations as tables, attributes as columns, and tuples as rows in those tables. These rules for organizing data are observed in the architect's version of data modeling, although modeling can reveal things not evident from relational theory alone. (See discussion on page 104.)
Dr. Codd expressed the rules as a series of progressively more restrictive constraints on the structure of data. The idea is that if all of the constraints are applied, the data will have the least redundancy. Every element (unless it was used as a key) would appear once and only once in the structure.
The constraints are described in terms of functional dependence. C. J. Date gives a mathematical definition of this [Date, 1986, p. 365], which amounts to the fact that a column is “functionally dependent” on another column, if, given a value of the second column, you will always have the same value for the first column. The column the others are dependent upon is called the primary key.
The process of normalization goes through a succession of “normal forms”.6 In each case, for data to be in that normal form, they must be in the preceding one and then meet at least one additional constraint. The first four forms (first through third, and Boyce Codd) apply structural constraints that allow one to assert compliance absolutely. Fourth and fifth normal form, on the other hand, assert that if the meanings of the data are of a certain kind, and you draw the model in a way appropriate to that meaning, then the constraints are satisfied.
Two points should be made about normalization:
First, its purpose is to create the simplest, least redundant organization of data possible. For this reason, it is a Row Three process. Designers in Row Four often depart from the normalized structure.
Second, the process looks at actual sets of values. It must assume that the sample data shown completely represent the relationships that exist in the things the data describe. In the real world, alas, you cannot assume that there is anything to prevent someone from introducing a new row that invalidates the whole analysis. This means that it is critical for any relationships identified to be verified by a subject matter expert.
Before Normal Forms
For a set of data to be a relation at all requires certain things:
-
It must be arranged in tuples of uniform length—that is, where every tuple has the same number of attributes.
-
There may be no meaning attributed to either the sequence of tuples or the sequence of attributes.
-
Each tuple must be uniquely identifiable by one or more attributes, called the primary key.
In other words, the data must be arranged in simple, two-dimensional forms. For example, see Table 3.2. The data describe various environmental chemical tests conducted on samples taken in someone's back yard. Each tuple represents the set of tests conducted on a particular sample. In this case, the primary key is a combination of the “Sample Date” and the “Mat'l Code” describing the material sampled. This example assumes that the same kind of material will not be sampled twice on the same day.
Table 3.2. Before Normal Forms
First Normal Form
First normal form adds an additional constraint:
-
Every attribute may have only one value for a tuple in a relation.
Note that in Table 3.2, the “Date”, “#”, “Test” (type and description), and “pH” for each test are repeated for as many tests as exist. This has several effects:
First, it is impossible to specify a uniform record length for the tuples. Ten percent of the tuples might have eight tests, while 90 percent might have two only.
Second, to find out the number of times a particular kind of test was run, it is necessary to examine not only every tuple, but also all the attributes in that tuple.
First normal form essentially eliminates the phenomenon of repeating groups. A repeating group is a set of attributes that can take multiple values for a given occurrence of an entity type. This is not permitted for a relation. There will be only one value for “Test Date” in any tuple of the relation. This produces a relation that looks like Table 3.3.
In this relation, notice that there is one tuple for each occurrence of a test. Hence we have named the relation “TEST”. This means that the primary key is now a combination of not only the “Sample Date” and “Mat'l Code” that we saw before, but also includes the “Test #”.
Table 3.3. First Normal Form
Second Normal Form
In Table 3.3 the “Material Description” value repeats every time the same material is sampled. This has several effects:
First, the repeated description uses unnecessary space.
Second, multiple entry of the same fact permit errors and inconsistencies to creep in. Note the two occurrences of the misspelling “watter”.
Third, if the description is changed for any reason, it must be changed for all tests of that material.
Fourth, you cannot assert that a material exists until it has been the subject of a test.
Note that all of these problems are derived from the fact that “Material Description” is functionally dependent on only part of the primary key, “Material Code”.
To address this, another constraint is added, constituting Second Normal Form:
-
Each attribute must be dependent on the entire primary key.
Addressing this issue yields the structure shown in Table 3.4. In this, a separate relation was created to correlate “Material Code” and “Material Description”. Since the value of “Material Description” is dependent only on the value of “Material Code”, it should not be in the TEST relation but in a relation by itself, here called MATERIAL. Note that “Material Code” is still in TEST, since the test data are still dependent on the material being tested. But data about each material need not be. That's what the MATERIAL relation is for.
Table 3.4. Second Normal Form
IMPORTANT NOTE: As stated previously, this analysis presumes that the sample data exhaustively define the relationships. Unless someone who knows the business can verify it, you have no assurance that a new tuple will not invalidate the whole analysis. In the above example, if material code 3256 could also describe “Radiator water” (or if “watter” is in fact a special kind of material), the second normal form step shown here wouldn't work.
Third Normal Form
In Table 3.4, the primary key of TEST is still “Sample Date”, “Material Code”, and “Test #”. All of the attributes are functionally dependent on those attributes for their values. But notice something. Every time “Test Type” has the value “A”, “Test Desc'n” has the value “strips”. Every time “Test Type” has the value “B”, “Test Desc'n” has the value “chem”. In other words, in addition to being dependent on the primary key, “Test Desc'n” is dependent on another non-key attribute.
This means that the value “strips” must be entered every time a “Test Type” has the value “A”. This causes many of the same problems described above for first normal form: Space is wasted; every entry must be spelled identically, and it may well not be; if the description of a test type is changed, it must be changed for every occurrence of the type; a test type may not be defined until it is used in a test. In other words, “Test Description” is dependent on “Test Type”, which is not part of the primary key.
This leads to specification of an additional constraint, to bring us to Third Normal Form:
-
Each attribute must be dependent only on the primary key.
This leads to the structure shown in Table 3.5. Since “Test Description” is dependent on “Test Type”, regardless of the actual tests conducted, these attributes are pulled out to a separate relation, TEST TYPE.
The “Test type” code still appears in TEST, since the type of test conducted is still an important part of the information about a test.
Table 3.5. Third Normal Form
Boyce/Codd Normal Form
To get to this point we've established that each attribute/attribute must be a function of “the key, the whole key, and nothing but the key, so help me Codd.”7 All of the forms so far have been about relationships between the key and non-key attributes.
But it may be the case that there are dependencies within parts of a compound key. Table 3.6 is a variation on our previous example that concerns where testing is to take place. Instead of using “Test Type” as in the previous examples, the primary key here is composed of four attributes: “Material Type”, “Test Type”, (Test) “Location”, and “Date”.
(As a simplification for this example, “Sample Date” and “Test Date” can be collapsed into “Date” by assuming that the same “Test Date” does not occur for two “Sample Dates”.)
The value of the test information (“pH”) is determined by what is being tested, what kind of test is conducted, where it is conducted, and when it is conducted.
Closer examination, however, suggests that every time “Location” is “yard”, the “Test Type” conducted there is “A”. Similarly, each time “Location” is “Lab”, the “Test Type is “B”, and each time “Location” is “Office”, the “Test Type” is “C”. That is, the test type is determined by the location where the test is conducted. This means that if the Location name changed, it would have to be changed for every occurrence of the corresponding “Test Type”, or, alternatively, if a Test Type Code changed, it would have to be changed for every occurrence of the corresponding Location.
In other words, “Location” and “Test Type”, two parts of the primary key, are dependent on each other.
Table 3.6. Another Example
If the relationships indicated by the data are true, there is an additional relation required: one that shows the relationships between “Test Type” and “Location”. This in turn implies that there are in fact two different candidate primary keys for TEST: “Material Type”/ “Test Type”/ “Date” and “Material Type”/ “Location”/ “Date” (see Table 3.7). Use of either of these, plus creation of the LOCATION relation, puts the configuration in “Boyce-Codd Normal Form”.
This is embodied in the following constraint:
-
No part of the primary key may be dependent on another part of the primary key.
Table 3.7. Boyce-Codd Normal Form
Fourth Normal Form
The next two kinds of normal form are more subtle, since, instead of having a single solution, they are each satisfied in one of two ways, depending on the nature of the data being described.
Table 3.8 shows a different variation on our problem. This relation describes the possible combinations of material type, test type, tester, and location of test. Each tuple requires all four attributes to be identified. Here we see that each material can be given a test type by someone (“Conducted By”) at a particular location (“Conducted At”). (We are not speaking here of actual tests, but simply of provision for who can conduct them and where.)
In the example Sam can conduct test type A at the Lab, in the Office, and in the Yard. Shirley can do so as well. In addition, Sam can conduct test type B in each location. Notice that if a new tester is added, a tuple must be created for each of the locations where the tester is qualified.
It appears, though, that each tester for a given test type is qualified for all possible locations. If that is the case, it is an unnecessary amount of extra work to itemize these.
Table 3.8. Yet Another Example
What this reveals is that there are independent sets of dependencies within the primary key. Each “Material Code”/ “Test Type” combination can be conducted at one of the three “conducted at” locations shown, and independently of this, each “Material Code”/“Test Type” combination can be “conducted by Sam or Shirley.” This calls for two relations instead of one.
The Fourth Normal Form constraint is:
-
There may be no independent sets of dependencies within a primary key.
If this is required and it is done, the result is in Fourth Normal Form, as shown in Table 3.9.
Note, by the way, that if Shirley is not permitted to conduct test A on Material 3256 in the yard, or Sam is not permitted to test Material 3256 in the office, then the “Conducted By” and “Conducted At” attributes are not independent. In that case, to be in Fourth Normal Form, the data must be left arranged as in Table 3.8.
Table 3.9. Fourth Normal Form
Fifth Normal Form
Our next example examines the relationships among Material Code, Test Code, and Location. Tables 3.10 include basic reference relations for MATERIAL, TEST TYPE, and LOCATION. Tables are also included for the intersections of each pair of these reference relations: MATERIAL TO BE TESTED, LOCATION OF TEST, and LOCATION OF MATERIAL. It is also possible to create a relation that contains permutations of all three reference relations—MATERIAL TO BE TESTED IN A LOCATION.
A structure is in Fifth Normal Form if these relations are properly configured.
Table 3.10. Still Another Example
Specifically, if the set of tuples in MATERIAL TO BE TESTED IN A LOCATION represents all possible permutations of the tuples in MATERIAL TO BE TESTED, LOCATION OF TEST, and LOCATION OF MATERIAL, then that relation is not needed.
In the example in Table 3.11:
-
Material Type 3256 can only be subject to Test Types C and A, while Material Type 4287 can be subject to Test Type B.
-
Material Type 3256 can only be collected at Location Types 12 and 14. Furthermore, Material Type 4287 can only be conducted at Location Type 18.
-
Test Type A can be conducted at Locations 12 and 14, but not 18, while Test Type B can only be conducted at Location 18.
If the three associative relations (Location of Material, Material to be Tested, and Location of Test) determine what can be in MATERIAL TO BE TESTED IN A LOCATION, then only the tuples “C 3256 14”, “A 3256 12”, and “B 4287 18” are permitted. Indeed, these are the tuples contained in MATERIAL TO BE TESTED IN A LOCATION, so that relation is redundant and must be eliminated to put the configuration in Fifth Normal Form.
Table 3.11. Fifth Normal Form—Case 1
On the other hand, if there are combinations in MATERIAL TO BE TESTED IN A LOCATION that are not reflected in one or more of the other intersect relations, or vice versa, then MATERIAL TO BE TESTED IN A LOCATION is required, as shown in Table 3.12.
Here MATERIAL 4287 can be subject to TEST TYPE B if it is in LOCATION TYPE 14, even though LOCATION OF TEST does not show Test Type B/“Location type” 14 as a legitimate combination by itself. Alternatively, Test Type B/“Location type” 18 is legal in LOCATION OF TEST, even though there are no examples of it in MATERIAL TO BE TESTED IN A LOCATION.
In other words, Fifth Normal Form adds the (informally worded here) constraint:
-
A three-way (or more) relationship is redundant if all its occurrences may be derived from combinations of two-way occurrences.
Table 3.12. Fifth Normal Form—Case 2
Data Modeling and Normalization
Dr. Codd's techniques for defining data structures rigorously apply specifically to relations and attributes. They are expressed in terms of the relationships between attributes, and particularly in terms of dependencies between attributes.
Thus, they represent a conceptual model of data. Relational database management systems can implement relational structure as is, but they also permit “de-normalization” (violations of the normalization rules) for performance and other reasons. Also, different database management systems implement aspects of the relational model in different ways. Hence, once a particular database management system and a particular table design is involved, you are in the designer's world, not the architect's.
Because relational theory yields a conceptual view of data, it should obtain the same view of data as is yielded from data modeling. Because it is an inductive (bottom up) process, however, rather than a deductive (top down) one, the results can be different.
For example, in Table 3.3, reproduced here as Table 3.13, we conclude that “Material Description” is dependent upon “Material Code”, based on the example tuples shown. There is no guarantee, however, that another tuple will not appear—in data that we have not yet seen—that shows a Material Code of “3256” but a Material Description of “soil”. In that case, Material Description would not be dependent upon Material Code, and the reconfiguration for Second Normal Form would not be appropriate. (And, as mentioned above, we shouldn't automatically rule out the possibility that “Pool watter” is a different substance of interest.)
Yes, we can usually apply inductive reasoning to real data and come up with good results, but there is no guarantee.
Table 3.13. Induction about Data
Modeling of the conceptual data schema at Row Three of the Architecture Framework approaches the problem from a different direction. The thought process here is not “what can we infer from this pot of real data” but “what are the things that this entity type is about?” In Figure 3.20 we start with the idea of TEST, but simple examination of the attributes suggests that it is about more than that. Based on the meaning of the attributes, the model seems to be about sampling and materials, as well as tests. That would suggest the existence of entity types SAMPLE, MATERIAL, and TEST.
Figure 3.20. The TEST Entity Type.
Moreover, it would seem that there are both a reference to test types and actual occurrences of tests. That is, instead of simple TEST, what we really have seems to be TEST and TEST TYPE.
This examination, presented in entity/relationship model form, gives the picture in Figure 3.21. This did call for us to come up with relationship names, but that helps as well in our understanding of what is presented. The model, then, asserts that each SAMPLE must be of one MATERIAL, and that the SAMPLE may be tested in one or more TEST OCCURRENCES. Each TEST OCCURRENCE in turn must be on one SAMPLE and an example of one TEST TYPE. Note that we are now describing aspects of the business in English, not simply immersing ourselves in a lot of data.
Figure 3.21. Test Occurrences.
The question, then, is not whether, for example, “Material description” is dependent on “Material code”. Rather it should be “What thing in the world does material description (and material code, for that matter) describe?” Logically those turn out to be the same question, but the latter version gives a much clearer view of the nature of the data.
Tables 3.14 are derived from this model. Note that they are almost the same as the Third Normal Form example in Tables 3.5, reproduced below as Tables 3.15.
There are minor differences between these two relation designs: The TEST relation has become the TEST OCCURRENCE relation. This is not terribly important. It simply clarifies the distinction between the definition of a test (its TEST TYPE) and examples of it (its TEST OCCURRENCES).
A more significant difference is the addition of the SAMPLE relation. Clearly, examination of the model has shown us (as normalization did not) that samples are one subject of this model. Since each test is for one sample only, however, the sample has no non-key attributes. Moreover, since the key to TEST (OCCURRENCE) already includes “Sample date” and “Material code”, the normalization process did not identify any need for a separate relation. In a relational design, this omission may well be appropriate.
In the future, however, if it becomes desirable to have attributes for SAMPLE (such as “Sample location”, for example), it will be convenient to have already at least postulated the entity type, even if it was not implemented as a relation. If the only “model” of these data is the normalized relation structure, the sudden addition of a new attribute will require a complete revisiting of the normalization process.
Table 3.14. Test Occurrence Tables
Table 3.15. Normalized Test Tables
Data modeling is particularly helpful when it comes to understanding the higher orders of normalization.
Table 3.16 shows the example from Tables 3.9. Here we had two attributes that may be independently a function of the rest of the primary key—“Material Code” and “Test Type”. By creating two data models of the alternative solutions, it is easy to ask which is true.
Table 3.16. The Fourth Normal Form Problem
In the first example (Figure 3.22), the conductor and place of testing are determined to be independent. Each POTENTIAL TEST may be independently conducted by one or more CONDUCTORS and/or conducted at one or more LOCATION OF TESTS. By drawing these assertions in a data model, their shape is much more clear. Note that in this examination we realized that CONDUCTOR itself is a reference to a different entity type, PERSON, and LOCATION OF TEST is a reference to a SITE. While this is the model that represents Table 3.9's solution, these entity types were not evident in the normalization exercise.
Figure 3.22. The First Fourth Normal Form Solution.
In the second example, (Figure 3.23) each POTENTIAL TEST is a unique combination of PERSON who is the conductor of it and the SITE which is the location of it. This means that only combinations of MATERIAL, TEST TYPE, SITE, and PERSON that are explicitly specified for the POTENTIAL TEST are permitted. You cannot infer combinations from combinations of less than the whole set.
Figure 3.23. The Second Fourth Normal Form Solution.
The point is, you can see the implications of each solution much more clearly in the data model than you could in the tables of sample data.
Thus, while data modeling properly done provides a model in fifth normal form inherently, it can also provide additional insights that are not seen in the normalization process alone.
Object-Oriented Design
Object-oriented design does not have to follow the normalization guidelines. Object-oriented technology can readily handle multi-valued attributes, object classes “within” other object classes, and so forth. As such, object-oriented designers often implement the business owners' views directly, with all their complex, un-normalized views of data. Moreover, it can more directly accommodate the inheritance presented in the sub-types and super-types of an entity/relationship or object model.
A system so designed, however, is vulnerable to future changes in the business. An object-oriented designer can take a normalized architectural model and combine entity types in all kinds of ways that appear to match the business owners' views, but when those views change, it is very nice to have the model of the fundamental nature of the data to refer to. If you go directly from the business owners' views to design, you are likely to be in trouble if the business owners change their minds as to what they want.
The disciplines used to ensure that an architect's entity/relationship model is consistent with normalized principles are critical to understanding the true nature of the data. To bypass this step is to create a system that can be devastated by changes in the business environment. Once the architectural understanding has been obtained, then the flexibility object-oriented designers have in representing data can be very useful.
Referential Integrity
Relational theory includes an algebra for describing the combination of relations to arrive at clusters of data in any form. Specifically, a view is a relation that is assembled from others. The rules of normalization do not have to apply to a view. Hence the sample tables above that were not in fifth normal form could be constructed from the normalized tables for display to a user.
This is accomplished by joining tables together. The tables in Tables 3.4, for example could be joined together on the “Material Code” column to produce the “view” of the data that is Table 3.3.
Implicit in this join is the assertion that the “Material Code” column in each table in fact mean the same thing.
The idea behind relational theory is that all relationships are explicit, through the meaning of the columns, not implemented in some way behind the scenes. Everything you must know is contained in the data in the relations.
In the world, however, there is a category of business rules that are not evident in the data. Specifically, for a set of relations to be properly used, they must address the question of referential integrity. Tables 3.5 (page 96) show three relations, TEST, MATERIAL, and TEST TYPE. What is not evident from these tables is what would happen if you delete MATERIAL 3256, “pool water”. A business rule must be applied:
-
Cascade delete – deletion of parent also deletes all of that parent's children. For example, deleting an occurrence of MATERIAL 3256 automatically means deleting all tuples in TEST with “Material Code” 3256.
-
Restricted – deletion of parent is prohibited if that parent has children. For example, MATERIAL 3256 may not be deleted if there are any tuples in TEST with “Material Code” 3256
-
Nullify – deletion of parent causes the value of the join column for all children to be set to “null”. (Allowed only if that column is optional.) For example, deleting MATERIAL 3256 is permitted, leaving tuples in TEST pointing to a non-existent MATERIAL 3256 (or if the value of “Material Code” in TEST is simply set to “null”.)