- Item 18: Use Views to Simplify What Cannot Be Changed
- Item 19: Use ETL to Turn Nonrelational Data into Information
- Item 20: Create Summary Tables and Maintain Them
- Item 21: Use UNION Statements to "Unpivot" Non-normalized Data
Item 21: Use UNION Statements to “Unpivot” Non-normalized Data
You saw in Item 3, “Get rid of repeating groups,” how UNION queries can be used to deal with repeating groups. We explore UNION queries a little bit more in this item. As you will learn in Item 22, “Understand relational algebra and how it is implemented in SQL,” the Union operation is one of the eight relational algebra operations that can be performed within the relational model defined by Dr. Edgar F. Codd. It is used to merge data sets created by two (or more) SELECT statements.
Assume that the only way you are able to get some data for analysis is in the form of the Excel spreadsheet pictured in Figure 3.5, which is obviously not normalized.
Figure 3.5 Non-normalized data from Excel
Assuming you can import that data into your DBMS, at best you will end up with a table (SalesSummary) that has five pairs of repeating groups, which we will call OctQuantity, OctSales, NovQuantity, NovSales, and so on to FebQuantity and FebSales.
Listing 3.8 shows a query that would let you look at the October data.
Listing 3.8 SQL to extract October data
SELECT Category, OctQuantity, OctSales FROM SalesSummary;
Of course, to look at the data for a different month, you need a different query. And let’s not forget that data that is not normalized can be more difficult to use for analysis purposes. This is where a UNION query can help.
There are three basic rules that apply when using UNION queries:
There must be the same number of columns in each of the queries making up the UNION query.
The order of the columns in each of the queries making up the UNION query must be the same.
The data types of the columns in each of the queries must be compatible.
Note that there is nothing in those rules about the names of the columns in the queries that make up the UNION query.
Listing 3.9 shows how to combine all of the data into a normalized view.
Listing 3.9 Using UNION to normalize the data
SELECT Category, OctQuantity, OctSales FROM SalesSummary UNION SELECT Category, NovQuantity, NovSales FROM SalesSummary UNION SELECT Category, DecQuantity, DecSales FROM SalesSummary UNION SELECT Category, JanQuantity, JanSales FROM SalesSummary UNION SELECT Category, FebQuantity, FebSales FROM SalesSummary;
Table 3.2 shows a partial extract of the data returned.
Table 3.2 Partial extract of data returned by the UNION query in Listing 3.9
Category |
OctQuantity |
OctSales |
Accessories |
923 |
60883.03 |
Accessories |
930 |
61165.40 |
. . . |
. . . |
. . . |
Bikes |
450 |
585130.50 |
Bikes |
542 |
705733.50 |
Car racks |
96 |
16772.05 |
Car racks |
115 |
20137.05 |
Car racks |
124 |
21763.30 |
. . . |
. . . |
. . . |
Skateboards |
203 |
89040.58 |
Skateboards |
204 |
79461.30 |
Tires |
110 |
3081.24 |
Tires |
137 |
3937.70 |
Tires |
150 |
4388.55 |
Tires |
151 |
4356.91 |
Tires |
186 |
5377.60 |
Two things should stand out. First, there is no way to distinguish to which month the data applies. The first two rows, for instance, represent the quantity and sales amount for Accessories for October and November, but there is no way to tell that from the data. As well, despite the fact that the data represents five months of sales, the columns are named OctQuantity and OctSales. That is because UNION queries get their column names from the names of the columns in the first SELECT statement.
Listing 3.10 shows a query that remedies both of those issues.
Listing 3.10 Tidying up the UNION query used to normalize the data
SELECT Category, 'Oct' AS SalesMonth, OctQuantity AS Quantity, OctSales AS SalesAmt FROM SalesSummary UNION SELECT Category, 'Nov', NovQuantity, NovSales FROM SalesSummary UNION SELECT Category, 'Dec', DecQuantity, DecSales FROM SalesSummary UNION SELECT Category, 'Jan', JanQuantity, JanSales FROM SalesSummary UNION SELECT Category, 'Feb', FebQuantity, FebSales FROM SalesSummary;
Table 3.3 shows the same partial extract returned by the query in Listing 3.10.
Table 3.3 Partial extract of data returned by the UNION query in Listing 3.10
Category |
SalesMonth |
Quantity |
SalesAmount |
Accessories |
Dec |
987 |
62758.14 |
Accessories |
Feb |
979 |
60242.47 |
. . . |
. . . |
. . . |
. . . |
Bikes |
Nov |
412 |
546657.00 |
Bikes |
Oct |
413 |
536590.50 |
Car racks |
Dec |
115 |
20137.05 |
Car racks |
Feb |
124 |
21763.30 |
Car racks |
Jan |
142 |
24794.75 |
. . . |
. . . |
. . . |
. . . |
Skateboards |
Nov |
203 |
89040.58 |
Skateboards |
Oct |
164 |
60530.06 |
Tires |
Dec |
150 |
4388.55 |
Tires |
Feb |
137 |
3937.70 |
Tires |
Jan |
186 |
5377.60 |
Tires |
Nov |
110 |
3081.24 |
Tires |
Oct |
151 |
4356.91 |
Should you want the data presented in a different sequence, the ORDER BY clause must appear after the last SELECT in the UNION query, as shown in Listing 3.11 on the next page.
Listing 3.11 Specifying the sort order of the UNION query
SELECT Category, 'Oct' AS SalesMonth, OctQuantity AS Quantity, OctSales AS SalesAmt FROM SalesSummary UNION SELECT Category, 'Nov', NovQuantity, NovSales FROM SalesSummary UNION SELECT Category, 'Dec', DecQuantity, DecSales FROM SalesSummary UNION SELECT Category, 'Jan', JanQuantity, JanSales FROM SalesSummary UNION SELECT Category, 'Feb', FebQuantity, FebSales FROM SalesSummary ORDER BY SalesMonth, Category;
Table 3.4 shows a partial extract returned by the query in Listing 3.11.
Table 3.4 Partial extract of data returned by the UNION query in Listing 3.11
Category |
SalesMonth |
Quantity |
SalesAmount |
Accessories |
Dec |
987 |
62758.14 |
Bikes |
Dec |
332 |
439831.50 |
Car racks |
Dec |
115 |
20137.05 |
Clothing |
Dec |
139 |
4937.74 |
Components |
Dec |
265 |
27480.22 |
Skateboards |
Dec |
129 |
59377.20 |
Tires |
Dec |
150 |
4388.55 |
Accessories |
Feb |
979 |
60242.47 |
Bikes |
Feb |
450 |
585130.50 |
Car racks |
Feb |
124 |
21763.30 |
. . . |
. . . |
. . . |
. . . |
Another consideration is that UNION queries eliminate any duplicate rows. Should this not be what you want, you can specify UNION ALL instead of UNION, and duplicates will not be eliminated. On the other hand, UNION ALL can provide performance improvements as it skips the step of deduplicating the result set, so if you know that the sources will not overlap, it can be advantageous to specify UNION ALL for those queries.
Things to Remember
Each of the SELECT statements in the UNION query must have the same number of columns.
Although the names of the columns in the various SELECT statements do not matter, the data types of each column must be compatible.
To control the order in which the data appears, you can use an ORDER BY clause after the last SELECT statement.
Use UNION ALL rather than UNION if you do not wish to eliminate duplicate rows or pay the performance penalty of deduplicating rows.