- 50,000-Foot View
- Populating the Fact and Dimension Tables
- Indexing Database Tables
- Building the Cube
- Defining the Storage Model
- Automating the Tasks
Populating the Fact and Dimension Tables
Depending on the nature of your data warehouse, you might have to populate your fact table(s) hourly, several times a day, nightly, once a week, once a month, or perhaps even once a year. It all depends on how volatile your data is. For instance, in the manufacturing industry, if your managers need to monitor a number of defects in products coming off the assembly line, your fact table might have to be refreshed hourly. On the other hand, if your marketing managers are comparing sales during a particular time period in the store with the same period last year, a monthly refresh of the fact table will be sufficient. Indeed, it usually wouldn't make much sense to compare sales on Tuesday with those on Saturday of the same week.
How often you refresh your fact table depends on your business needs, so be sure to check with your users. If the organization already has some reports, that will give you a clue to the frequency that managers need to examine their data. Keep in mind, though, that one of the reasons you're building a data warehouse is because the existing reports are not sufficient, so don't rely solely on what's already thereask what would make the managers' jobs easier and more productive.
Populating the dimension tables is much trickier than populating the fact tables. Some dimensions are relatively small and static. After you have created these, you almost never have to worry about them again. For example, consider a department dimension: Sometimes this dimension is referred to as organizational unit. Granted, a department name might change every year or so, or a new department might be added. But, in general, every organization will have Sales, Marketing, Finance, Operation, and a handful of other departments. If the S department is renamed to Marketing, all you have to do is change one record in the dimension table, and you're done. The exception to this rule is when managers want to see data under the Sales heading for the duration of time when the department was called Sales and then see everything else under the heading of Marketing. If that's the case, you'll have to add a column to the dimension table that gives you the date range during which the department had a particular name. In addition, you might want to have a separate key for Marketing and Sales members of the department dimension.
With other dimensions, you don't have such a luxury. For instance, consider the customer dimension. If you're building a data warehouse for retail stores chain, you might have thousands or even millions of customers. You'll have to update this dimension every time you need to rebuild the fact table; in addition, you'll have to update the dimension table before populating the fact table. If you don't have a particular customer in a dimension, then your fact table cannot have a key pointing to that customer. This means that customer Gary Jones will have to be assigned a key of 12345 before you can write a record to the fact table representing Gary's purchase.
When you're working with frequently modified dimensions, you have to warrant the capability to rebuild the dimension before you rebuild the fact table. Therefore, you might want to put the whole rebuilding routine in a transaction. But I'm jumping a bit ahead of the game herewe'll talk about rebuilding routines later.
You can make an important conclusion from the previous couple of paragraphs: Each dimension can change in different ways. You can have additional dimension members, or some of the dimension member values can change over time. The former change is relatively easy to handle: Just add the new members to the dimension, and you're done. The latter change, on the other hand, can be handled in multiple ways. The concept of changing dimension member values is sometimes referred to as "slowly changing dimensions." We already discussed one way of dealing with changing dimension member values: adding a column to the dimension table that tells you the date when the member value changed and then adding a row to the table to assign a new key to the new dimension member. This is a tough way to resolve the problem because each member might change many times; for example, a female customer might have a maiden name, a married name, a divorced name, a name from the second marriage, and so on. Each time Ms. Jones changes her name, you need to add a new row and a pointer to the original record for Ms. Jones so that you know it is the same person.
An easier solution is to overwrite the existing value with the new value; if Ms. Jones decides to be Ms. Walters, just change her name and don't worry about any previous names used by Ms. Walters. Any purchases that Ms. Walters made while using different names will still appear on reports as Ms. Walters's. In many environments this would be acceptable; however, for certain projects (for instance, government-related work) you'll have to know exactly what the person's name was at the time of the transaction. Yet another way to handle slowly changing dimension values is to store aliases. We would record that Mr. Jones is the same person as Ms. Walters and Ms. Ravichandar, but you won't change the "main" customer name, so the managers will always know whose purchasing behavior they're examining on each report.
NOTE
Slowly changing dimensions are one of the most difficult data warehousing topics to learn and master. This introductory article just barely scratches the surface.