- Dimensional Modeling
- Changing Dimensions
- Populating the Fact Tables
- Summary
Changing Dimensions
Perhaps one of the biggest challenges in maintaining a data warehouse is managing historical data values. For example, suppose that you have an employee table that tracks organizational hierarchy by recording the employee ID of the supervisor (as Supervisor_ID):
Employee |
Employee_ID |
Last_name |
First_name |
... (other columns) |
Supervisor_ID |
What happens when the organizational structure changes? Suppose that Jim White worked for Jason Green from 1999 to 2002, and Jason Green found greener pastures and left the company in 2002. James Bond became the new supervisor for Jim White. In a transactional system, this change is likely to recorded as simply overwriting the supervisor IDchanging it from Jason Green's employee ID to that of James Bond. This way, all new reports show James as Jim's supervisor. But what about the historical reports? Would James Bond like to get credit for Jason Green's work? Depends on whether Jason had done a good job or not, right?
There are several ways of managing changing dimensions, and each of them has merit in certain situations depending on business needs. For example, if FINANCE department changes its name to FINANCES, you are safe in presuming that overwriting the existing value with the new value won't hurt many feelings. So the easiest way to handle the change is to discard the historical value and replace it with the new value.
The next idea is to record the old value, the new value, and the date of the change. To support such functionality, the employee dimension would have to change slightly:
Employee |
Sequence_number |
Employee_ID |
Last_name |
First_name |
... (other columns) |
Supervisor_ID |
Supervisor_Start_date |
Supervisor_End_date |
This way, the reports would have to check the dates for which the data is retrieved and report either Jason Green or James Bond as Jim White's supervisor. Notice that in such cases employee_id can no longer be used to uniquely identify recordseach employee that has had more than one supervisor will have more than one record in the employee dimension. Therefore, I added the sequence_number column to uniquely identify each record. Doing so allows me to run reports by employee (regardless of who managed the employee at the time the report was generated) or by supervisor.
Yet another way to treat the changing dimension is to create a new record when certain values change and do not associate the old record with the new one. Suppose that we're building a data warehouse for a video store chain. And suppose that renting habits of single women are much different from those of married women. When one of our customers, Lacey Smith, gets married and changes her name to Lacey Smith-Butler, we no longer care to associate her rental history of a single woman to her new trends as a married woman. So we keep her old record, but we also create a new record in which Lacey's marital status appears as married:
Customer_ID |
Last_name |
First_name |
Marital_status |
Sex |
Record_creation_date |
12345 |
Smith |
Lacey |
Single |
Female |
06-20-1999 |
349500 |
Smith-Butler |
Lacey |
Married |
Female |
06-20-2003 |
We examined various ways in which we can approach the maintenance of changing dimensions. This seems fairly straightforward if we were to examine each dimension record manually. However, unless you're building a data warehouse for a simple classroom exercise, manual examination of each record is not an option. Some advanced (and expensive) data warehousing tools allow automatic maintenance of changing dimensions. But in most cases, changing dimensions must be maintained by using the routines that import data from the transactional sources into the data warehouse.
Code in the routines for populating changing dimensions will differ depending on which method you use for maintaining the change; if you are replacing the existing value with the new value, you can simply empty out the existing dimension table (by using the TRUNCATE TABLE statement) and populate it with new values.
If you are maintaining the history of when changes occurred, you have to compare values of each column for which you're tracking history and populate the start_date and end_date columns accordingly. In addition, you have to create a new record and copy values of columns that are the same in the old and new record.
Finally, you might be creating a new record without associating it with the old record if values of a particular column change. If so, you need to examine the contents of the column in question and add a new record to the dimension.