- Item 18: Use Views to Simplify What Cannot Be Changed
- Item 19: Use ETL to Turn Nonrelational Data into Information
- Item 20: Create Summary Tables and Maintain Them
- Item 21: Use UNION Statements to "Unpivot" Non-normalized Data
Item 20: Create Summary Tables and Maintain Them
We mentioned previously (in Item 18, “Use views to simplify what cannot be changed”) that views can be used to simplify complex queries, and we even suggested that views can be used to provide summarizations. Depending on the volume of data, there are times when it may be more appropriate to create summary tables.
When you have a summary table, you can be sure that everything is in one place, making it easier to understand the data structure and quicker to return information.
One approach is to create a table that summarizes your data in your details table, and write triggers to update the summary table every time something changes in the details table. However, if your details table is frequently modified, this can be processor intensive.
Another approach is to use a stored procedure to refresh the summary table on a regular basis: delete all existing data rows and reinsert the summarized information.
DB2 has the concept of summary tables built into it. DB2 summary tables can maintain a summary of data in one or multiple tables. You have the option to have the summary refreshed every time the data in underlying table(s) changes, or you can refresh it manually. DB2 summary tables not only allow users to obtain results faster, but the optimizer can use the summary tables when user queries indirectly request information already summarized in the summary tables if ENABLE QUERY OPTIMIZATION is specified when you create the summary table. Although there may still be “costs” associated with all that activity, at least you did not have to write triggers or stored procedures to maintain the data for you.
Listing 3.6 shows how to create a summary table named SalesSummary that summarizes data from six different tables in DB2. Note that the SQL is not much different from that for creating a view. In fact, a summary table is a specific type of materialized query table, identified by the inclusion of a GROUP BY clause in the CREATE SQL. Note that we had to use Cartesian joins with filters, because of the restriction against using INNER JOIN in a materialized query table, and additionally provide COUNT(*) in the SELECT list to enable the use of the REFRESH IMMEDIATE clause. Those are necessary to permit the optimizer to use it.
Listing 3.6 Creating a summary table based on six tables (DB2)
CREATE SUMMARY TABLE SalesSummary AS ( SELECT t5.RegionName AS RegionName, t5.CountryCode AS CountryCode, t6.ProductTypeCode AS ProductTypeCode, t4.CurrentYear AS CurrentYear, t4.CurrentQuarter AS CurrentQuarter, t4.CurrentMonth AS CurrentMonth, COUNT(*) AS RowCount, SUM(t1.Sales) AS Sales, SUM(t1.Cost * t1.Quantity) AS Cost, SUM(t1.Quantity) AS Quantity, SUM(t1.GrossProfit) AS GrossProfit FROM Sales AS t1, Retailer AS t2, Product AS t3, datTime AS t4, Region AS t5, ProductType AS t6 WHERE t1.RetailerId = t2.RetailerId AND t1.ProductId = t3.ProductId AND t1.OrderDay = t4.DayKey AND t2.RetailerCountryCode = t5.CountryCode AND t3.ProductTypeId = t6.ProductTypeId GROUP BY t5.RegionName, t5.CountryCode, t6.ProductTypeCode, t4.CurrentYear, t4.CurrentQuarter, t4.CurrentMonth ) DATA INITIALLY DEFERRED REFRESH IMMEDIATE ENABLE QUERY OPTIMIZATION MAINTAINED BY SYSTEM NOT LOGGED INITIALLY;
Listing 3.7 on the next page shows how to provide a similar capability in Oracle through the use of a materialized view.
Listing 3.7 Creating a materialized view based on six tables (Oracle)
CREATE MATERIALIZED VIEW SalesSummary TABLESPACE TABLESPACE1 BUILD IMMEDIATE REFRESH FAST ON DEMAND AS SELECT SUM(t1.Sales) AS Sales, SUM(t1.Cost * t1.Quantity) AS Cost, SUM(t1.Quantity) AS Quantity, SUM(t1.GrossProfit) AS GrossProfit, t5.RegionName AS RegionName, t5.CountryCode AS CountryCode, t6.ProductTypeCode AS ProductTypeCode, t4.CurrentYear AS CurrentYear, t4.CurrentQuarter AS CurrentQuarter, t4.CurrentMonth AS CurrentMonth FROM Sales AS t1 INNER JOIN Retailer AS t2 ON t1.RetailerId = t2.RetailerId INNER JOIN Product AS t3 ON t1.ProductId = t3.ProductId INNER JOIN datTime AS t4 ON t1.OrderDay = t4.DayKey INNER JOIN Region AS t5 ON t2.RetailerCountryCode = t5.CountryCode INNER JOIN ProductType AS t6 ON t3.ProductTypeId = t6.ProductTypeId GROUP BY t5.RegionName, t5.CountryCode, t6.ProductTypeCode, t4.CurrentYear, t4.CurrentQuarter, t4.CurrentMonth;
Although SQL Server does not directly support materialized views, the fact that you can create indexes on views has a similar effect, and thus you can use indexed views in a similar manner.
Note that there can be some negative aspects to summary tables as well, such as the following:
Each summary table occupies storage.
The administrative work (triggers, constraints, stored procedures) may need to exist on both the original table and any summary tables.
You need to know in advance what users want to query in order to precompute the required aggregations and include them in the summary tables.
You may need multiple summary tables if you need different groupings or filters applied.
You may need to set up a schedule to manage the refresh of the summary tables.
You may need to manage the periodicity of the summary tables via SQL. For example, if the summary table is supposed to show the past 12 months, you need a way to remove data that is more than a year old from the table.
One possible suggestion to avoid some of the increased administrative costs of having redundant triggers, constraints, and stored procedures is to use what Ken Henderson referred to as inline summarization in his book The Guru’s Guide to Transact-SQL (Addison-Wesley, 2000). This involves adding aggregation columns to the existing table. You would use an INSERT INTO SQL statement to aggregate data and store those aggregations in the same table. Columns that are not part of the aggregated data would be set to a known value (such as NULL or some fixed date). An advantage of doing inline summarization is that the summary and the detail data can be easily queried together or separately. The summarized records are easily identified by the known values in certain columns, but other than that, they are indistinguishable from the detail records. However, this approach necessitates that all queries on the table containing both detail and summary data be written appropriately.
Things to Remember
Storing summarized data can help minimize the processing required for aggregation.
Using tables to store the summarized data allows you to index fields containing the aggregated data for more efficient queries on aggregates.
Summarization works best on tables that are more or less static. If the source tables change too often, the overhead of summarization may be too great.
Triggers can be used to perform summarization, but a stored procedure to rebuild the summary table is usually better.