- Recommended Practices for Dimensions
- Recommended Practices with Partitions and Aggregations
- Summary
Recommended Practices with Partitions and Aggregations
The following sections offer recommendations for working with partitions and aggregations.
#1: Abide by Prescribed Limits for Partition Sizes
Microsoft recommends limiting partition sizes so they contain up to 20 million rows or have a file size up to 1GB. Some applications attempt to use much larger partitions, perhaps because data is partitioned only by month or by day. Realize that during processing, MSAS has to read an entire partition’s data on a single thread. It is often more efficient to process five partitions of 20 million rows, in parallel, as opposed to processing a single partition with 100 million rows. Furthermore, if you partition your data according to the typical query patterns, you will see a far superior query performance than if your measure group had a single large partition.
For example, let’s suppose your measure group is partitioned by year and by product category. Suppose we have data for five years and for three categories (bikes, accessories and clothing). If we store all data in a single 15GB partition, every query will have to examine this 15GB data file (presuming data is not found in an MSAS storage engine cache and no useful aggregations exist for resolving the query). Now let’s split the data into 15 partitions of 1GB eachone for a combination of each year and category. A query examining bike sales for 2009 will only have to read a single 1GB file. Scanning 15GB of data will invariably be slower than scanning a 1GB file. Many people feel that they will end up with too many partitions if they partition data on any dimension other than time or date dimension. This simply is not true. Theoretically, there is a limit to the number of partitions per cube2 billion. Most cubes will have far fewer partitions, however. So go ahead and partition by multiple hierarchies when possible to match the pattern of data retrieval.
You have a couple of options for populating measure groups partitioned by multiple hierarchies. You could define a separate view in the relational data source for each partition, each view retrieving only portion of fact table’s data. Alternatively, you could also bind each partition’s definition to a different query. Personally, I prefer the second option, particularly for environments where I don’t have direct access to make schema changes in the relational source.
#2: Define the Slice Property for Every Partition
Much like dimensions, each partition also has several properties that should be carefully examined and configured appropriately. Although some literature advises that setting a partition slice property is unnecessary for MOLAP partitions, do yourself a favor and set this property for every partition. At query time, MSAS checks partition XML files (these files are called info.version_number.xml) for internal data ids identifying data ranges for each dimension attribute. At times, if partition slice isn’t defined, you will notice that MSAS reads more partitions than necessary to resolve a query. For example, instead of only reading the bike_sales_2004 partition, Analysis Services may also read the clothing_sales_2005 partition if the slice property isn’t set, even if the query only requested data for bike sales in 2004. Reading a single partition will be faster than reading multiple partitions.
#3: Use the Aggregation Manager Sample Tool for Designing Custom Aggregations
Microsoft re-engineered the Usage-Based Optimization (UBO) Wizard with Analysis Services 2008 because with version 2005, it wasn’t always effective; at times, the wizard would not create useful aggregations even if you chose a 100 percent performance improvement goal. Business Intelligence Development Studio (BIDS) 2008 also offers the ability to pick and choose which attributes should be included in a specific aggregation through the advanced view within the Aggregations tab. SQL Server Management Studio (SSMS) allows scripting aggregation designs. However, neither BIDS nor SSMS wizards allow crafting aggregations for specific queries.
If you find that UBO does not meet your needs, then download the Aggregation Manger sample tool. The tool is easy to use and works with both 2005 and 2008 versions. First, clear the query log, next execute the query workload for which you would like to tune performance, and then build aggregations based on the query log. You may want to use eliminate redundancy and remove duplicates options within Aggregation Manager so that you don’t have too many aggregations.
You can have multiple aggregation designs per measure group. For example, you could have one aggregation design with many aggregations for frequently accessed partitions; another aggregation design containing only a few aggregations could be applied to rarely queried historical partitions.
#4: Use Separate Measure Groups for Distinct Count Measures
This recommendation is well documented but not always followed. BIDS automatically assigns a measure with distinct count aggregation function to its own separate measure group. However, if you create a measure with sum, count or another aggregation function and later change its aggregation function to distinct count, then BIDS will allow you to shoot yourself in the foot. Fortunately, BIDS 2008 does warn developers of such mistakes.
#5: Specify the Maximum Degree of Parallelism for Processing Objects on Multi-Processor Servers
By default, MSAS decides the appropriate degree of parallelism for processing operations. However, on multi-processor hosts you may find that software sometimes attempts to do more than it can handle. Processing too many partitions in parallel may also exact unbearable load on the relational database server. Fortunately, you can override the default option and specify the degree of parallelism for processing operations through XMLA commands.