Dimensional Databases: Building A Data Warehouse
- Dimensional Modeling
- Changing Dimensions
- Populating the Fact Tables
- Summary
Building a data warehouse involves several complicated steps (some of which I described in my previous articles for InformIT). After the data warehousing architect locates all data elements necessary to support the data warehouse, it is time to build a dimensional model. As data change in transactional systems, the data warehouse needs to have a way of tracking and reflecting such changes. Populating fact and dimension tables can take considerable amount of time if the population routines aren't designed carefully. This article addresses some of the challenges involved in building and maintaining dimensional databases that serve as the foundation of a data warehouse.
NOTE
This article focuses on SQL Server-specific implementations, but concepts discussed here can also apply to data warehouses built using other database management systems.
Dimensional Modeling
Most developers are very familiar with relational data modeling principles. Although you can still find some "brilliant" systems out there that keep all data in a single table, many IT professionals know the normalization rules and adhere to them as they should. Normalized tables work very well in transactional systems. However, when it comes to reporting, joining dozens of tables isn't very efficient. Thus, most organizations use separate databases for reporting and for transaction processing. Reporting systems allow some denormalization, but are still built using relational modeling concepts.
Dimensional modeling is somewhat different from its relational counterpart. No, I'm not referring to the height, width, and length when I talk about dimensional tables. All tables consist of rows and columns, so don't expect any magic here. The term dimensional modeling stems from the fact that dimensional databases are used to slice data along certain dimensions. For example, a cake factory dimensional database could have dimensions of product, ingredient, shift, employee, supervisor, time, and so forth. Such database could be used to generate reports that break down total production by each sort of cake, by date and time when cakes were baked, by shift that produced the goods, and so on.
Dimensional models consist of one or several fact tables and many dimension tables. For this reason they are sometimes referred to as star schemasone fact table surrounded by numerous dimension tables.
Fact tables record something tangible that can be reported. For example, a car dealership might record the fact that customer Jim White bought a 2002 Toyota Camry for $10,529 on February 20, 2005. Fact tables typically have foreign keys to all dimension tables; continuing with the dealership example, the sales fact table would have foreign keys to the customer dimension, the time dimension, and the vehicle dimension, among others.
Dimension tables are typically very simplethey contain the levels on which you want to group your reports. For example, the vehicle dimension could look similar to the following:
Vehicle_dimension Vehicle_key Vehicle_Identification_Number (VIN) Vehicle_type (car, van, truck, SUV, RV) Vehicle_class (luxury, sports, etc) Brand Make Model Year_manufactured
Such a design would allow breaking down total sales by the vehicle class, type, brand, make, model, or year in which the vehicle was made.
Time dimension, which is found in most data warehouses, usually has the following structure:
Time_dimension |
Time_key |
Year |
Quarter |
Month |
Day |
Day_of_the_year |
If you need a finer grain of detail on your reports, the time dimension can also include hours, minutes, and even seconds.
So how is a dimensional model different from a relational model? Well, fact or dimension tables aren't found in the relational database. Normalization rules command to keep each piece of data only in a single spot in the database. Hence, the vehicle dimension that you just examined would be represented by several normalized tables in the relational world, as shown next:
Table |
Vehicle |
Vehicle_type |
Brand |
Make |
Model |
Column |
Vehicle_key |
Vehicle_type_key |
Brand_key |
Make_key |
Model_key |
Column |
Vehicle_type_key |
Type_name |
Brand_name |
Brand_key |
Make_key |
Column |
Model_key |
|
|
Make_name |
Model_name |
Column |
Year_made |
A dimensional model doesn't celebrate each dimension level with a separate table. Rather, it allows redundant data within a dimension.
Some data warehouses are implemented using the "snowflake" schema, which is a special case of the star schema. Snowflake simply normalizes one or multiple dimensions, or each dimension might be made up of more than one table. For example, a human resources data warehouse might contain a job dimension, which could be made up of job title and job category tables. Job title might or might not be associated with the job category; therefore, the job dimension would look as follows:
Job_title |
Job_category |
Job_key |
job_category_key |
job_title_key |
Job_category |
Job_title |
Grade_level |
Job_description |
|
Job_category_key |
|
Most data warehouse implementations tend to begin with the star schema. If one of the dimensions grows large and could be more manageable by having multiple tables instead of one, the star schema evolves into the snowflake model.
Overall dimensional modeling isn't too complicatedmere mortals like you and me can get used to it with some practice. However, be sure to present the finished model to the end users at least twice: Changes to the model can be costly for a couple of reasons. If you have overlooked a dimension that users want to slice their data with, you have to do the following:
Find where data for this dimension are located.
Figure out how to extract this data.
Determine how to maintain changes to this dimension (see more on this in the next section).
Change fact table and DW population routines.