Indexing Database Tables
When you're done importing your data to the fact and dimension tables, it's time to think about indexing your database tables to improve performance. How long will it take to build the cubes, and how fast will the queries be executed against the fact and dimension tables? Ideally, most of the business questions should be answered by querying the cube using multidimensional expressions (MDX) and ActiveX Data Objects Multi-Dimensional (ADOMD). However, if any of your users are SQLsavvy, they'll opt for writing queries directly against the fact and dimension tables. Still, your main concern should be to optimize the cube-rebuilding tasks.
The fact table should have a clustered index defined on the combination of all dimension keys. Sometimes such a combination might qualify as a primary key for the fact table, but that won't always be the case. I'll discuss why and when a primary key would not be a combination of foreign keys in a following article. For now, just realize that because you'll be joining the fact table to the dimension tables, the clustered index should be built on all foreign key columns (keep in mind that you can have only one clustered index per table). In addition, it helps to define a nonclustered index on each individual foreign key in the fact table. Individual dimension tables, on the other hand, are not likely to be queried frequently. Therefore, indexing dimension tables is relatively straightforward: You can place a clustered index on a primary key or another column that might be queried frequently. A nonclustered index on a primary key of a dimension table will do just fine.
By now, you might be asking how you are supposed to load the data and maintain the indexes that I recommended. If so, give yourself a credityou know that indexes can be a blessing in disguise when it comes to INSERT, UPDATE, and DELETE statements. The issue here is that SQL Server has to maintain the index keys as data modifications occur. Therefore, INSERT, UDPATE, and DELETE statements will generally run more slowly on tables with indexes. Before you rush off to test this theory, let me warn you: SQL Server 2000 is an awesome product that manages indexes much more efficiently than any of its predecessors. Therefore, if you have 10,000 rows in your fact table, indexes will not slow your performance drastically. But when you get up to several million rows, tiny index maintenance taxes will add up quickly.
Some experts suggest populating the fact tables through BULK copy operations. SQL Server offers a command-line utility affectionately referred to as BCP (for the Bulk Copy Program), as well as a Transact-SQL BULK INSERT command to handle such tasks. I'll save the discussion of BCP and BULK INSERT for another article. Personally, I do not recommend writing data out to the file system and then copying it back to SQL Server format (which is what you'd have to do if your transactional data is already in SQL Server). In my experience a better approach has been to do the following:
- Drop the indexes on the fact table.
- Insert the new records into the fact table.
- Truncate the transaction log when you're finished populating the fact table.
- Rebuild indexes on the fact table.
I know that this sounds like a lot of work, but if your performance matters, this investment of effort will pay substantial dividends.
This discussion means that you have to drop and re-create indexes on the fact table each time you populate it with new data. The indexes on the dimension tables are usually not so troublesome, unless you have thousands of new customers to insert into the Customer dimension each time you populate your data warehouse. Most of the time, rebuilding indexes on the dimension tables once for every 5 to 7 populations of the fact table will provide acceptable performance.