Now that we have much of the relevant terminology out of the way, we can look at what normalization means in relevant terms. The example here isn't the typical employee-manager-department example or the alternative student-professor-course offering example. I will look at a hypothetical insurance company's database. The tables are somewhat more complex than the ones usually used in examples, but are close to what many people find in real life.
Figure 1 shows the denormalized definition of the claim table. Although there will likely be far more tables in an insurance company's database (including but certainly not limited to client, adjuster, and cause), these tables can give us a background through which to look at normalization and its ramifications. Keep in mind that the examples under each of the sections will be only part of the columns, to simplify the example and to allow you to see easily what is being changed.
CLAIM_NUM OCCURANCE_NUM CLAIM_STATUS ACCDNT_YR ACCDNT_DT REPORTED_DT ENTERED_DT CLAIM_DT1 CLAIM_DT2 CLAIM_DT3 CLAIM_DT4 CLAIM_DT4 CLAIM_DT5 CLAIM_DT6 CLAIM_DT7 CLAIM_DT8 CLAIM_DT9 CLAIM_DT10 CLOSED_DT DEATH_DT ASSIGNED_DT ADJSTER_CD ADJUSTER_NAME AGENT_CD AWARD_CD CAUSE_CD CAUSE_DESC LOCATION SITE COVERAGE_CD COVERAGE_DESC DED_RECOV DEDUCTIBLE_REMAIN PAID_1 RESERVED_1 PAID_2 RESERVED_2 PAID_3 RESERVED_3 PAID_4 RESERVED_4 PAID_5 RESERVED_5 PAID_6 RESERVED_6 PAID_7 RESERVED_7 PAID_8 RESERVED_8 PAID_9 RESERVED_9 PAID_10 RESERVED_10 LEGAL_FLG KEY1 KEY2 KEY3 KEY4 KEY5 KEY6 KEY7 KEY8 KEY9 KEY10 SEVERITY_CD POLICY_NUM PAYMENT_NUM SSN STATE ACTVY_DT ENTRY_DT ADMIN_CD ADMIN_DESC REOPEN_DT INSURED_NAME INSURED_ADDRESS INSURED_PHONE INSURED_CITY INSURED_STATE INSURED_ZIP CLAIMANT_NAME CLAIMANT_ADDRESS CLAIMANT_CITY CLAIMANT_STATE CLAIMANT_ZIP CLAIMANT_PHONE SPECIAL_DT_1 SPECIAL_DT_2 SPECIAL_DT_3 SPECIAL_DT_4 SPECIAL_DT_5 SPECIAL_DT_6 SPECIAL_DT_7 SPECIAL_DT_8 SPECIAL_DT_9 SPECIAL_DT_10 GROSS_PD POLICY_ID
Figure 1 Columns in the denormalized example Claim table
First Normal Form (1NF)
Converting a database, or (more simply and attainable if you have inherited a denormalized database) a database table to the first normal form is usually fairly simple. The first normal form rule calls for the elimination of repeating groups of data, which is accomplished through the creation of separate tables of related data. This makes the table decisions by observing the data as much as just the table structure to accomplish first normal form.
First normal form eliminates repeating groups by putting each into a separate table and connecting them with a one-to-many relationship.
No repeating attributes and no repeating sets of valuesthis sounds simple enough. It is sometimes difficult, however, to convince people that simply adding another set of whatever it is to the design because you are out of other options, and that it really "goes" with what you are doing, anyway.
If we want to put the claim table into first normal form, we have to find all of the attributes that are really connected just to a claim. What makes a claim, a claim?
A claim has a number.
A claim has a claimant.
A claim has a reported date.
A claim has an accident or illness date.
A claim has a reserved amount for the things that might be caused by the accident or illness.
A claim belongs to or is written against a policy.
A claim can be closed
A claim can be reopened.
Does a claim have a kind of coverage, or is that more something that a policy has?
Does a claim have a cause, or does the accident or illness have a cause?
Do you pay a claim, or do you pay an invoice?
Does a claim have a Social Security number? Or is a Social Security number something that belongs to a claimant?
Death date is an interesting call. Does a claim die? No, but if it is life insurance, it might be relevant to the claim...so maybe it should stay.
A revised set of columns that is directly related to what is a claim iswhat makes a claim a claimfollows in Figure 2:
CLAIM_NUM CLAIM_STATUS ACCIDENT_YR ACCIDENT_DT REPORTED_DT ENTERED_DT CLOSED_DT DEATH_DT ASSIGNED_DT ADJSTER_CD ADJUSTER_NAME AGENT_CD AGENT_NAME AWARD_CD AWARD_DESC PAYMENT_NUM LOCATION SITE DEDUCTIBLE_RECOVER DEDUCTIBLE_REMAIN POLICY_NO POLICY_DESCRIPTION STATE RUN_DT ACTIVITY_DT ENTRY_DT REOPEN_DT INSURED_NAME INSURED_ADDRESS INSURED_PHONE INSURED_CITY INSURED_STATE INSURED_ZIP CLAIMANT_NAME CLAIMANT_ADDRESS CLAIMANT_CITY CLAIMANT_STATE CLAIMANT_ZIP CLAIMANT_PHONE GROSS_PD
Figure 2 Claim table in First Normal Form
A revised version of the claim table put into first normal form would include information that has to do only with a claim and not a payment or an invoice, not a policy or an accident.
Payment_num |
Claim_status |
Accident_dt |
Accident_yr |
Reported_dt |
Entered_dt |
123456789 |
Open |
20-JUN-2000 |
2000 |
28-JUN-2000 |
29-JUN-2000 |
234567890 |
Reviewed |
15-FEB-1984 |
1984 |
19-FEB-1984 |
20-FEB-1984 |
147258369 |
Reopened |
08-APR-2003 |
2003 |
10-APR-2003 |
11-APR-2003 |
258369147 |
Closed |
18-DEC-1980 |
1980 |
18-DEC-1980 |
19-DEC-1980 |
If you have a payment table and you store the amount that you have reserved on a particular claim to allocate to paying different kids of bills, why not just store them in the payment table? You are storing this information on the payment table anyway, so why not leave it there and not on the claim?
If the only reason to put it on the claim is that a user might want that information at the claim level, the claim and payment tables can be joined, and the information can be derived by summing all the payments that occur for a single claim. And because you have different kinds of insurance policies (and therefore different kinds of claims), why not just store all payments for all kinds of claims in a single table? It's logical to store all payments in the same table. Most of the information that is associated to a payment (the attribute) is the same, regardless of what kind of payment or what kind of claim it is. The accounting information for the different kinds of claims differs somewhat, however.
A health insurance claim would probably not have automobile damage reserves (the money set aside for a claim to draw upon for the purpose of paying bills). A new mother's claim for health insurance for the delivery of a baby typically would not have monies set aside for death benefits. But they all have need of a kind of reserve and they all draw against those reserves to pay benefits.
Second Normal Form (2NF)
Second normal form deals with the elimination of redundant data. Second normal form is often violated when information within the table is dependent on other columns in the table that are not part of the key.
If the new first normal form claim table's columns look like this, the redundant data that can be quickly and easily addressed is the insured city and state as well as claimant city and state. City and state are directly dependent on the Zip code of a place, not on anything that is connected to the claim.
CLAIM_NUM CLAIM_STATUS ACCIDENT_YR ACCIDENT_DT REPORTED_DT ENTERED_DT CLOSED_DT DEATH_DT ASSIGNED_DT ADJSTER_CD ADJUSTER_NAME AGENT_CD AGENT_NAME AWARD_CD AWARD_DESC LOCATION SITE DEDUCTIBLE_RECOVER DEDUCTIBLE_REMAIN POLICY_NO POLICY_DESCRIPTION STATE RUN_DT ACTIVITY_DT ENTRY_DT REOPEN_DT INSURED_NAME INSURED_ADDRESS INSURED_PHONE INSURED_CITY INSURED_STATE INSURED_ZIP CLAIMANT_NAME CLAIMANT_ADDRESS CLAIMANT_CITY CLAIMANT_STATE CLAIMANT_ZIP
Figure 3 Claim Table in Second Normal Form
Claim_num |
Claimant_name |
Claimant_address |
Claimant_city |
Claimant_state |
Claimant_zip |
123456789 |
Jennifer Smith |
1234 Main |
Pittsburgh |
PA |
15201 |
234567890 |
Bill Smith |
7852 Eagle |
Pittsburgh |
PA |
15202 |
147258369 |
John Jones |
4562 Edge |
Eighty Four |
PA |
15330 |
258369147 |
Eleanor Stillwater |
7531 West Eastern |
Somerset |
PA |
15510 |
Zip_Code |
City |
State |
15330 |
Eighty Four |
PA |
15510 |
Somerset |
PA |
15201 |
Pittsburgh |
PA |
15202 |
Pittsburgh |
PA |
15203 |
Pittsburgh |
PA |
15204 |
Pittsburgh |
PA |
15205 |
Pittsburgh |
PA |
15206 |
Pittsburgh |
PA |
15207 |
Pittsburgh |
PA |
15208 |
Pittsburgh |
PA |
15209 |
Pittsburgh |
PA |
15210 |
Pittsburgh |
PA |
Because Pittsburgh, Eighty Four, and Somerset, PA were dependent not on the claim, but on the Zip code that is attached to that information, it does not directly belong in the payment table. Although this is not the whole problem with this table, it does eliminate the difficulty that comes with the city, state, Zip code dependency.
Claim_num |
Claimant_name |
Claimant_address |
Claimant_zip |
123456789 |
Jennifer Smith |
1234 Main |
15201 |
234567890 |
Bill Smith |
7852 Eagle |
15202 |
147258369 |
John Jones |
4562 Edge |
15330 |
258369147 |
Eleanor Stillwater |
7531 West Eastern |
15510 |
Other information that could be migrated to other tables to further migrate the claim table to second normal form is to take out the award code award description combination and store only award code in the claim table. That way, if there is any update on the description of any given code that needs to be changed, it could be changed in one row in one column in the award table, and there wouldn't be the chance of update anomaly that might happen if you were to update one column in a tableaffecting hundreds or thousands of entries. The same logic dictates that similar changes be made to adjuster and agent, migrating their information into their own tables and storing only the code column value in the claim table, thus facilitating the join that enables access of the auxiliary information.
award_cd |
award_desc |
adjuster_cd |
adjuster_name |
agent_cd |
agent_name |
Third Normal Form (3NF)
The third normal form rule seeks to eliminate all the attributes from a table that are not directly dependent on the primary key in combination with the table also being in first normal form and second normal form. For all the pieces that are not directly related to the primary key of the table, a new table is created. Each new table houses the pieces from the original table with the key on which they are directly dependent.
NOTE
It is often said that third normal form basically says, "The key, the whole key, and nothing but the key."
CLAIM_Num CLAIM_STATUS ACCIDENT_DT REPORTED_DT ENTERED_DT CLOSED_DT DEATH_DT ASSIGNED_DT ADJSTER_CD AGENT_CD AWARD_CD LOCATION SITE DEDUCTIBLE_RECOVER DEDUCTIBLE_REMAIN POLICY_NO STATE RUN_DT ACTIVITY_DT ENTRY_DT REOPEN_DT INSURED_NAME INSURED_ADDRESS INSURED_PHONE INSURED_ZIP CLAIMANT_NAME CLAIMANT_ADDRESS CLAIMANT_ZIP
Figure 4 Claim Table in Third Normal Form
More on third normal form transformation can be seen in the claim table, in which the insured name, address, phone number, and Zip code are more dependent on the policy that was written than on the claim itself. We can, therefore, move the insured's information to a policy table. This leaves the claim table with information that is most directly related to a claim and leaves all the other information (the policy; award; adjuster; agent; and city, state, Zip code information) into its own table structures with enough information to allow for no loss of information. A simple join of all these tables would result in being able to reconstruct the original table's information, and that is what relational algebra and relational calculus (the basis on which relational theory and relational database are based) aim to do.
POLICY_NO |
INSURED_NAME |
INSURED_ADDRESS |
INSURED_PHONE |
INSURED_ZIP |
Third normal form is often as far as people take normalization, and it is usually as far as is practical in normalizing and standardizing data. There are more normal forms, however. The higher and higher the number, the more difficult it is to accomplish in simple steps and the more and more it approaches theory only.
Fourth Normal Form
In fourth normal form, relations cannot have non-atomic or multi-valued attributes.
Fifth Normal Form
Fifth normal form eliminates anomalies that result from join dependencies and breaks out data redundancy that is not covered by any of the previous normal forms.