Populating a Data Warehouse with SQL Server 2000
Building and maintaining data warehouses (DW) is one of the hottest topics in today's IT industry. Companies that have been collecting transactional data for decades have also been thinking about using such data effectively. Many businesses have been utilizing standard reports that have a "set-in-stone" structure. Others have their programmers run ad-hoc reports when needed. However, such reports no longer can provide strategic advantage to companies. What they need instead is a cost-effective set of tools to let them slice and dice their data along appropriate dimensions to get them the particular information that they need each time. Data warehouses happen to be such tools.
Even though data warehousing has been around for more than 20 years, the total cost of owning a DW has been unaffordable for small to midsize companies. The introduction of data warehousing support with SQL Server versions 7.0 and 2000 has made data warehouses much more affordable.
Many complex steps are involved in building and maintaining a successful data warehouse. First, you have to identify all the reporting needs of the organization. Next you need to figure where the data will come from and how to integrate data originating from various sources. Then you need to build a dimensional data model that will support the reporting needs and populate this data model with data.
This article talks about populating the data warehouses. Although most principles covered in this article apply to any database management system utilized for building a data warehouse, I will cover specific implementation details only for Microsoft SQL Server 2000.
50,000-Foot View
All data warehouses consist of measures and dimensions. The measures are individual facts to be represented on reports, whereas the dimensions are how the facts need to be broken down. For instance, a data warehouse for a grocery chain might have dimensions for customers, suppliers, stores, managers, and measures of costs and revenues.
When the dimensions and measures have been identified and the dimensional model has been built, it's time to move on to the physical implementation. This phase consists of several steps:
- Populating the fact and dimension tables with data
- Building appropriate indexes
- Creating Analysis Services cubes
- Building aggregations
- Creating programs to automate the process of updating your data warehouse
I've put step 5 at the end for a reason. Populating the tables and building cubes manually the first time gives you an idea of what it takes to refresh data in the fact and dimension tables, how long it takes to build aggregations, and which dimensions need to be continuously updated. Therefore, physical implementation is more likely to be successful with a bottom-up approach: First perform your steps manually, and then automate as many of the tasks as possible through your code.
Let's consider each step in detail.