T-SQL Language Enhancements
Even though this book is much about the CLR and outside access to SQL Server, let's not forget that Microsoft has enhanced the T-SQL language a lot in SQL Server 2005. In this section, we will look at some of the improvements.
TOP
TOP was introduced in SQL Server 7. Until SQL Server 2005, the TOP clause allowed the user to specify the number or percent of rows to be returned in a SELECT statement. In SQL Server 2005, the TOP clause can be used also for INSERT, UPDATE, and DELETE (in addition to SELECT), and the syntax is as follows: TOP (expression) [PERCENT]. Notice the parentheses around the expression; this is required when TOP is used for UPDATE, INSERT, and DELETE.
The following code shows some examples of using TOP.
create a table and insert some data CREATE TABLE toptest (col1 VARCHAR(150)) INSERT INTO toptest VALUES('Niels1') INSERT INTO toptest VALUES('Niels2') INSERT INTO toptest VALUES('Niels3') INSERT INTO toptest VALUES('Niels4') INSERT INTO toptest VALUES('Niels5') this returns 'Niels1' and 'Niels2' SELECT TOP(2) * FROM toptest this sets 'Niels1' and 'Niels2' to 'hi' UPDATE TOP(2) toptest SET col1 = 'hi' SELECT * FROM toptest the two rows with 'hi' are deleted DELETE TOP(2) toptest SELECT * FROM toptest create a new table and insert some data CREATE TABLE toptest2 (col1 VARCHAR(150)) INSERT INTO toptest2 VALUES('Niels1') INSERT INTO toptest2 VALUES('Niels2') INSERT INTO toptest2 VALUES('Niels3') INSERT INTO toptest2 VALUES('Niels4') INSERT INTO toptest2 VALUES('Niels5') 'Niels1' and 'Niels2' are inserted INSERT top(2) toptest SELECT * FROM toptest2 SELECT * FROM toptest
An additional difference between the TOP clause in previous versions of SQL Server and in SQL Server 2005 is that we now can use expressions for number definition. The following code shows a couple of examples of that (it uses the tables from the preceding example).
declare 3 variables DECLARE @a INT DECLARE @b INT DECLARE @c INT set values SET @a = 10 SET @b = 5 SELECT @c = @a/@b use the calculated expression SELECT TOP(@c)* FROM toptest insert some more data in toptest INSERT INTO toptest VALUES('Niels6') INSERT INTO toptest VALUES('Niels7') INSERT INTO toptest VALUES('Niels8') use a SELECT statement as expression this should return 5 rows SELECT TOP(SELECT COUNT(*) FROM toptest2) * FROM toptest
The next T-SQL enhancement we'll look at is something completely new in SQL Server: the OUTPUT clause.
OUTPUT
The execution of a DML statement such as INSERT, UPDATE, or DELETE does not produce any results that indicate what was changed. Prior to SQL Server 2005, an extra round trip to the database was required to determine the changes. In SQL Server 2005 the INSERT, UPDATE, and DELETE statements have been enhanced to support an OUTPUT clause so that a single round trip is all that is required to modify the database and determine what changed. You use the OUTPUT clause together with the inserted and deleted virtual tables, much as in a trigger. The OUTPUT clause must be used with an INTO expression to fill a table. Typically, this will be a table variable. The following example creates a table, inserts some data, and finally deletes some records.
create table and insert data CREATE TABLE outputtbl (id INT IDENTITY, col1 VARCHAR(15)) go INSERT INTO outputtbl VALUES('row1') INSERT INTO outputtbl VALUES ('row2') INSERT INTO outputtbl VALUES ('row5') INSERT INTO outputtbl VALUES ('row6') INSERT INTO outputtbl VALUES ('row7') INSERT INTO outputtbl VALUES ('row8') INSERT INTO outputtbl VALUES ('row9') INSERT INTO outputtbl VALUES ('row10') make a table variable to hold the results of the OUTPUT clause DECLARE @del AS TABLE (deletedId INT, deletedValue VARCHAR(15)) delete two rows and return through the output clause DELETE outputtbl OUTPUT DELETED.id, DELETED.col1 INTO @del WHERE id < 3 SELECT * FROM @del GO deletedId deletedValue - - 1 row1 2 row2 (2 row(s) affected)
The previous example inserted the id and col1 values of the rows that were deleted into the table variable @del.
When used with an UPDATE command, OUTPUT produces both a DELETED and an INSERTED table. The DELETED table contains the values before the UPDATE command, and the DELETED table has the values after the UPDATE command. An example follows that shows OUTPUT being used to capture the result of an UPDATE.
update records, this populates both the inserted and deleted tables DECLARE @changes TABLE (id INT, oldValue VARCHAR(15), newValue VARCHAR(15)) UPDATE outputtbl SET col1 = 'updated' OUTPUT inserted.id, deleted.col1, inserted.col1 INTO @changes WHERE id < 5 SELECT * FROM @changes GO id oldValue newValue - - - 3 row5 updated 4 row6 updated (2 row(s) affected)
Common Table Expressions and Recursive Queries
A Common Table Expression, or CTE, is an expression that produces a table that is referred to by name within the context of a single query. The general syntax for a CTE follows.
[WITH <common_table_expression> [,...n] ] <common_table_expression>::= expression_name [(column_name [,...n])] AS (<CTE_query_expression>)
The following SQL batch shows a trivial usage of a CTE just to give you a feeling for its syntax.
WITH MathConst(PI, Avogadro) AS (SELECT 3.14159, 6.022e23) SELECT * FROM MathConst GO PI Avogadro 3.14159 6.022E+23 (1 row(s) affected)
The WITH clause, in effect, defines a table and its columns. This example says that a table named MathConst has two columns named PI and Avogadro. This is followed by a SELECT statement enclosed in parentheses after an AS keyword. And finally, all this is followed by a SELECT statement that references the MathConst table. Note that the syntax of the WITH clause is very similar to that of a VIEW. One way to think of a CTE is as a VIEW that lasts only for the life of the query expression at the end of the CTE. In the example, MathConst acts like a VIEW that is referenced in the query expression at the end of the CTE.
It is possible to define multiple tables in a CTE. A SQL batch follows that shows another trivial usage of a CTE that defines two tables, again shown just to make the syntax clear.
WITH MathConst(PI, Avogadro) AS (SELECT 3.14159, 6.022e23), second table Package(Length, Width) AS (SELECT 2, 5) SELECT * FROM MathConst, Package PI Avogadro Length Width - 3.14159 6.022E+23 2 5 (1 row(s) affected)
In this example, the CTE produced two tables, and the query expression merely joined them.
Both of the previous examples could have been done without using CTEs and, in fact, would have been easier to do without them. So what good are they?
In once sense, a CTE is just an alternate syntax for creating a VIEW that exists for one SQL expression, or it can be thought of as a more convenient way to use a derived tablethat is, a subquery. However, CTEs are part of the SQL-92 standard, so adding them to SQL Server increases its standards compliance. In addition, CTEs are implemented in other databases, so ports from those databases may be easier with the addition of CTEs.
In some cases, CTEs can save a significant amount of typing and may provide extra information that can be used when the query plan is optimized. Let's look at an example where this is the case.
For this example, we will use three tables from the AdventureWorks database, a sample database that is distributed with SQL Server. We will use the SalesPerson, SalesHeader, and SalesDetail tables. The Sales Person table lists each salesperson that works for AdventureWorks. For each sale made at AdventureWorks, a SalesHeader is entered along with a SalesDetail for each item that that was sold in that sale. Each Sales Header lists the ID of the salesperson who made the sale. Each Sales Detail entry lists a part number, its unit price, and the quantity of the part sold.
The stock room has just called the Big Boss and told him that they are out of part number 90. The Big Boss calls you and wants you to make a report that lists the ID of each salesperson. Along with the ID, the Big Boss wants the text "MakeCall" listed if a salesperson made a sale that depends on part number 90 to be complete. Otherwise, he wants the text "Relax" printed. Just to ensure that the report lights a fire under the salespeople, the Big Boss also wants each line to list the value of the sale and the salesperson's sales quota.
Before we actually make use of the CTE, let's first write a query that finds all the IDs of salespeople who have sales that depend on part number 90.
SELECT DISTINCT SH.SalesPersonId FROM SalesOrderHeader SH JOIN SalesOrderDetail SD ON SH.SalesOrderId = SD.SalesOrderId AND SD.ProductID = 90 SalesPersonId GO SalesPersonId - 14 21 22 more rows (14 row(s) affected)
But the Big Boss has asked for a report with lines that look like this.
Action SalesPersonID SalesQuota Value - MakeCall 22 250000.0000 2332.7784 ... more lines Relax 35 250000.0000 0
Each line number has the ID of a salesperson. If that salesperson has an order that depends on part number 90, the first column says "MakeCall" and the last column has the value involved in the order. Otherwise, the first column says "Relax" and the last column has 0 in it.
Figure 7-5: A Chart of Accounts
Without CTEs, we could use a subquery to find the salespeople with orders that depend on the missing part to make the report the Big Boss wants, as in the SQL batch that follows.
SELECT 'MakeCall' AS Action, S.SalesPersonID, S.SalesQuota, (SELECT SUM(SD.UnitPrice * SD.OrderQty) FROM SalesOrderHeader SH JOIN SalesOrderDetail SD ON SH.SalesOrderId = SD.SalesOrderId AND SD.ProductID=90 AND SH.SalesPersonID=S.SalesPersonID ) FROM SalesPerson S WHERE EXISTS ( SELECT * FROM SalesOrderHeader SH JOIN SalesOrderDetail SD ON SH.SalesOrderID = SD.SalesOrderID AND SD.ProductID = 90 AND SH.SalesPersonID = S.SalesPersonID ) UNION SELECT 'Relax' AS Action, S.SalesPersonID, S.SalesQuota, 0 FROM SalesPerson S WHERE NOT EXISTS ( SELECT * FROM SalesOrderHeader SH JOIN SalesOrderDetail SD ON SH.SalesOrderID = SD.SalesOrderID AND SD.ProductID = 90 AND SH.SalesPersonID = S.SalesPersonID )
Notice that the subquery is reused in a number of placesonce in the calculation of the value of the sales involved in the missing part and then again, twice more, in finding the salespeople involved in sales with and without the missing part.
Now let's produce the same report using a CTE.
WITH Missing(SP, AMT) AS( SELECT SH.SalesPersonID, SUM(SD.UnitPrice * SD.OrderQty) FROM SalesOrderHeader SH JOIN SalesOrderDetail SD ON SH.SalesOrderId = SD.SalesOrderId AND SD.ProductID=90 GROUP BY SH.SalesPersonID ) SELECT 'MakeCall' AS Action, S.SalesPersonID, S.SalesQuota, Missing.AMT FROM Missing JOIN SalesPerson S ON Missing.SP = S.SalesPersonID UNION SELECT 'Relax' AS Action, S.SalesPersonID, S.SalesQuota, 0 FROM SalesPerson S WHERE S.SalesPersonID NOT IN (SELECT SP FROM Missing)
The Missing CTE is a table that has a row for each salesperson who has an order that depends on the missing part, and the value of what is missing. Notice that the Missing table is used in one part of the query to find the value of the missing parts and in another to determine whether a sales person should "MakeCall" or "Relax".
Although your opinion may differ, the CTE syntax is a bit clear and more encapsulated; that is, there is only one place that defines what orders are missing part number 90. Also, in theory, the CTE is giving the optimizer a bit more information in that it is telling the optimizer it plans on using Missing more than once.
The CTE is also part of another feature of SQL Server 2005 that is also part of the SQL:1999 standard. It is called a recursive query. This is especially useful for a chart of accounts in an accounting system or a parts explosion in a bill of materials. Both of these involve tree-structured data. In general, a recursive query is useful anytime tree-structured data is involved. We will look at an example of a chart of accounts to see how recursive queries work.
Figure 7-5 shows a simple chart of accounts containing two kinds of accounts: detail accounts and rollup accounts. Detail accounts have an actual balance associated with them; when a posting is made to an accounting system, it is posted to detail accounts. In Figure 7-5, account 4001 is a detail account that has a balance of $12.
Rollup accounts are used to summarize the totals of other accounts, which may be detail accounts or other rollup accounts. Every account, except for the root account, has a parent. The total of a rollup account is the sum of the accounts that are its children. In Figure 7-5 account 3002 is a rollup account, and it represents the sum of its two children, accounts 4001 and 4002.
In practice, one of the ways to represent a chart of accounts is to have two tables: one for detail accounts and the other for rollup accounts. A detail account has an account number, a parent account number, and a balance for columns. A rollup account has an account number and a parent but no balance associated with it. The SQL batch that follows builds and populates these two tables for the accounts shown in Figure 7-5.
CREATE TABLE DetailAccount(id INT PRIMARY KEY, parent INT, balance FLOAT) CREATE TABLE RollupAccount(id INT PRIMARY KEY, parent INT) INSERT INTO DetailAccount VALUES (3001, 2001, 10) INSERT INTO DetailAccount VALUES(4001, 3002, 12) INSERT INTO DetailAccount VALUES(4002, 3002, 14) INSERT INTO DetailAccount VALUES(3004, 2002, 17) INSERT INTO DetailAccount VALUES(3005, 2002, 10) INSERT INTO DetailAccount VALUES(3006, 2002, 25) INSERT INTO DetailAccount VALUES(3007, 2003, 7) INSERT INTO DetailAccount VALUES(3008, 2003, 9) INSERT INTO RollupAccount VALUES(3002, 2001) INSERT INTO RollupAccount VALUES(2001, 1000) INSERT INTO RollupAccount VALUES(2002, 1000) INSERT INTO RollupAccount VALUES(2003, 1000) INSERT INTO RollupAccount VALUES(1000, 0)
Figure 7-6: Recursive Query
Note that this example does not include any referential integrity constraints or other information to make it easier to follow.
A typical thing to do with a chart of accounts it to calculate the value of all the rollup accounts or, in some cases, the value of a particular rollup account. In Figure 7-5 (shown earlier) the value of the rollup accounts is shown in gray, next to the account itself. We would like to be able to write a SQL batch like the one that follows.
SELECT id, balance FROM Rollup a handy view id balance - 1000 104 2001 36 2002 52 2003 16 3001 10 3002 26 3004 17 3005 10 3006 25 3007 7 3008 9 4001 12 4002 14 (13 row(s) affected) SELECT id, balance FROM Rollup WHERE id = 2001 id balance - 2001 36 (1 row(s) affected)
This query shows a view name, Rollup, that we can query to find the values of all the accounts in the chart of accounts or an individual account. Let's look at how we can do this.
To start with, we will make a recursive query that just lists all the account numbers, starting with the top rollup account, 1000. The query that follows does this.
WITH Rollup(id, parent) AS ( anchor SELECT id, parent FROM RollupAccount WHERE id = 1000 UNION ALL recursive call SELECT R1.id, R1.parent FROM ( SELECT id, parent FROM DetailAccount UNION ALL SELECT id, parent FROM RollupAccount ) R1 JOIN Rollup R2 ON R2.id = r1.parent ) selecting results SELECT id, parent FROM Rollup GO id parent - - 1000 0 2001 1000 2002 1000 2003 1000 3007 2003 3008 2003 3004 2002 3005 2002 3006 2002 3001 2001 3002 2001 4001 3002 4002 3002 (13 row(s) affected)
The previous batch creates a CTE named Rollup. There are three parts to a CTE when it is used to do recursion. The anchor, which initializes the recursion, is first. It sets the initial values of Rollup. In this case, Rollup is initialized to a table that has a single row representing the rollup account with id = 1000. The anchor may not make reference to the CTE Rollup.
The recursive call follows a UNION ALL keyword. UNION ALL must be used in this case. It makes reference to the CTE Rollup. The recursive call will be executed repeatedly until it produces no results. Each time it is called, Rollup will be the results of the previous call. Figure 7-6 shows the results of the anchor and each recursive call.
First the anchor is run, and it produces a result set that includes only the account 1000. Next the recursive call is run and produces a resultset that consists of all the accounts that have as a parent account 1000. The recursive call runs repeatedly, each time joined with its own previous result to produce the children of the accounts selected in the previous recursion. Also note that the recursive call itself is a UNION ALL because the accounts are spread out between the DetailAccount table and the RollupAccount table.
After the body of the CTE, the SELECT statement just selects all the results in Rollupthat is, the UNION of all the results produced by calls in the CTE body.
Now that we can produce a list of all the accounts by walking through the hierarchy from top to bottom, we can use what we learned to calculate the value of each account.
To calculate the values of the accounts, we must work from the bottom upthat is from the detail accounts up to the rollup account 1000. This means that our anchor must select all the detail accounts, and the recursive calls must progressively walk up the hierarchy to account 1000. Note that there is no requirement that the anchor produce a single row; it is just a SELECT statement.
The query that follows produces the values of all the accounts, both detail and rollup.
WITH Rollup(id, parent, balance) AS ( anchor SELECT id, parent, balance FROM DetailAccount UNION ALL recursive call SELECT R1.id, R1.parent, R2.balance FROM RollupAccount R1 JOIN Rollup R2 ON R1.id = R2.parent ) SELECT id, SUM(balance) balance FROM Rollup GROUP BY id GO id balance - 1000 104 2001 36 2002 52 2003 16 3001 10 3002 26 3004 17 3005 10 3006 25 3007 7 3008 9 4001 12 4002 14 (13 row(s) affected)
This query starts by having the anchor select all the detail accounts. The recursive call selects all the accounts that are parents, along with any balance produced by the previous call. This results in a table in which accounts are listed more than once. In fact, the table has as many rows for an account as that account has descendant accounts that are detail accounts. For example, if you looked at the rows produced for account 2001, you would see the three rows shown in the following diagram.
id balance - 2001 14 2001 12 2001 10
The balances 14, 12, and 10 correspond to the balances in the detail accounts 3001, 4001, and 4002, which are all decedents of account 2001. The query that follows the body of the CTE then groups the rows that are produced by account ID and calculates the balance with the SUM function.
There are other ways to solve this problem without using CTEs. A batch that uses a stored procedure that calls itself or a cursor could produce the same result. However, the CTE is a query, and it can be used to define a view, something a stored procedure or a cursor-based batch cannot. The view definition that follows defines a view, which is the recursive query we used earlier, and then uses it to get the balance for a single account, account 2001.
CREATE VIEW Rollup AS WITH Rollup(id, parent, balance) AS ( SELECT id, parent, balance FROM DetailAccount UNION ALL SELECT R1.id, R1.parent, R2.balance FROM RollupAccount R1 JOIN Rollup R2 ON R1.id = R2.parent ) SELECT id, SUM(balance) balance FROM Rollup GROUP ID id GO get the balance for account 2001 SELECT balance FROM rollup WHERE id = 2001 GO balance 36 (1 row(s) affected)
One of the strengths of a recursive query is the fact that it is a query and can be used to define a view. In addition, a single query in SQL Server is always a transaction, which means that a recursive query based on a CTE is a transaction.
Recursive queries, like any recursive algorithm, can go on forever. By default, if a recursive query attempts to do more than 100 recursions, it will be aborted. You can control this with an OPTION(MAXRECURSION 10), for example, to limit recursion to a depth of 10. The example that follows shows its usage.
WITH Rollup(id, parent, balance) AS ( body of CTE removed for clarity ) SELECT id, SUM(balance) balance FROM Rollup GROUP BY id OPTION (MAXRECURSION 10) GO
APPLY Operators
T-SQL adds two specialized join operators: CROSS APPLY and OUTER APPLY. Both act like a JOIN operator in that they produce the Cartesian product of two tables except that no ON clause is allowed. The following SQL batch is an example of a CROSS APPLY between two tables.
CREATE TABLE T1 ( ID int ) CREATE TABLE T2 ( ID it ) GO INSERT INTO T1 VALUES (1) INSERT INTO T1 VALUES (2) INSERT INTO T2 VALUES (3) INSERT INTO T2 VALUES (4) GO SELECT COUNT(*) FROM T1 CROSS APPLY T2 - 4
The APPLY operators have little utility with just tables or views; a CROSS JOIN could have been substituted in the preceding example and gotten the same results. It is intended that the APPLY operators be used with a table-valued function on their right, with the parameters for the table-valued function coming from the table on the left. The following SQL batch shows an example of this.
Table 7-5: Individual Sales, Including Quarter of Sale
Year |
Quarter |
Amount |
2001 |
Q1 |
100 |
2001 |
Q2 |
120 |
2001 |
Q2 |
70 |
2001 |
Q3 |
55 |
2001 |
Q3 |
110 |
2001 |
Q4 |
90 |
2002 |
Q1 |
200 |
2002 |
Q2 |
150 |
2002 |
Q2 |
40 |
2002 |
Q2 |
60 |
2002 |
Q3 |
120 |
2002 |
Q3 |
110 |
2002 |
Q4 |
180 |
CREATE TABLE Belt ( model VARCHAR(20), length FLOAT ) GO fill table with some data DECLARE @index INT SET @index = 5 WHILE(@index > 0) BEGIN INSERT INTO BELT VALUES ('B' + CONVERT(VARCHAR, @index), 10 * @index) SET @index = @index 1 END GO make a table-valued function CREATE FUNCTION Stretch (@length FLOAT) RETURN @T TABLE ( MinLength FLOAT, MaxLength FLOAT ) AS BEGIN IF (@length > 20) INSERT @T VALUES (@length 4, @length + 5) RETURN END GO SELECT B.* S.MinLength, S.MaxLength FROM Belt AS B CROSS APPLY Stretch(B.Length) AS S GO B30, 26, 35 B40, 36, 45 B50, 46, 55
Table 7-6: Yearly Sales Broken Down by Quarter
Year |
Q1 |
Q2 |
Q3 |
Q4 |
2001 |
100 |
190 |
165 |
90 |
2002 |
200 |
250 |
230 |
180 |
Figure 7-7: Tables for Hardware Store
The rows in the Belt table are cross-applied to the Stretch function. This function produces a table with a single row in it if the @length parameter passed into it is greater than 20; otherwise, it produces a table with no rows in it. The CROSS APPLY operator produces output when each table involved in the CROSS APPLY has at least one row in it. It is similar to a CROSS JOIN in this respect.
OUTER APPLY is similar to OUTER JOIN in that it produces output for all rows involved in the OUTER APPLY. The following SQL batch shows the results of an OUTER APPLY involving the same Belt table and Stretch function as in the previous example.
SELECT B.* S.MinLength, S.MaxLength FROM Belt AS B CROSS APPLY Stretch(B.Length) AS S GO B10, 6, 15 B20, 16, 25 B30, 26, 35 B40, 36, 45 B50, 46, 55
The preceding example could have been done using CROSS and OUTER JOIN. CROSS APPLY is required, however, when used in conjunction with XML data types in certain XML operations that will be discussed in Chapter 9.
PIVOT Command
SQL Server 2005 adds the PIVOT command to T-SQL, so named because it can create a new table by swapping the rows and columns of an existing table. PIVOT is part of the OLAP section of the SQL:1999 standard. There are two general uses for the PIVOT command. One it to create an analytical view of some data, and the other is to implement an open schema.
A typical analytical use of the PIVOT command is to covert temporal data into categorized data in order to make the data easier to analyze. Consider a table used to record each sale made as it occurs; each row represents a single sale and includes the quarter that indicates when it occurred. This sort of view makes sense for recording sales but is not easy to use if you want to compare sales made in the same quarter, year over year.
Table 7-5 lists temporally recorded sales. You want to analyze same-quarter sales year by year from the data in the table. Each row represents a single sale. Note that this table might be a view of a more general table of individual sales that includes a date rather than a quarterly enumeration.
The PIVOT command, which we will look at shortly, can convert this temporal view of individual sales into a view that has years categorized by sales in a quarter. Table 7-6 shows this.
Presenting the data this way makes it much easier to analyze same-quarter sales. This table aggregates year rows for each given year in the previous table into a single row. However, the aggregated amounts are broken out into quarters rather than being aggregated over the entire year.
The other use of the PIVOT command is to implement an open schema. An open schema allows arbitrary attributes to be associated with an entity. For example, consider a hardware store; its entities are the products that it sells. Each product has a number of attributes used to describe it. One common attribute of all products it the name of the product.
The hardware store sells "Swish" brand paint that has attributes of quantity, color, and type. It also sells "AttachIt" fastener screws, and these have attributes of pitch and diameter. Over time, it expects to add many other products to its inventory. With this categorization "Swish, 1 qt, green, latex" would be one product or entity, and "Swish, 1qt, blue, oil" would be another.
A classic solution to designing the database the hardware store will use to maintain its inventory is to design a table per product. For example, a table named Swish with columns for quantity, color, and type. This, of course, requires products and their attributes to be known and for those attributes to remain constant over time. What happens if the manufacturer of the Swish paint adds a new attribute, "Drying Time", but only to certain colors of paint?
An alternate solution is to have only two tables, regardless of the number of products involved or the attributes they have. In the case of the hardware store, there would be a Product table and a Properties table. The Product table would have an entry per product, and the Properties table would contain the arbitrary attributes of that product. The properties of a product are linked to it via a foreign key. This is called an open schema. Figure 7-7 shows the two ways of designing tables to represent the inventory of the hardware store.
The PIVOT operator can easily convert data that is stored using an open schema to a view that looks the same as the table-per-product solution. Next, we will look at the details of using PIVOT to analyze data and support open schemas, and then how to use PIVOT to work with open schemas. There is also an UNPIVOT operator, which can be used to produce the original open schema format from previously pivoted results.
Using PIVOT for Analysis
In this example, we are going to use PIVOT to analyze the sales data we showed in an earlier table. To do this, we build a SALES table and populate it with data, as is shown in the following SQL batch.
CREATE TABLE SALES ( [Year] INT, Quarter CHAR(2), Amount FLOAT ) GO INSERT INTO SALES VALUES (2001, 'Q2', 70) INSERT INTO SALES VALUES (2001, 'Q3', 55) INSERT INTO SALES VALUES (2001, 'Q3', 110) INSERT INTO SALES VALUES (2001, 'Q4', 90) INSERT INTO SALES VALUES (2002, 'Q1', 200) INSERT INTO SALES VALUES (2002, 'Q2', 150) INSERT INTO SALES VALUES (2002, 'Q2', 40) INSERT INTO SALES VALUES (2002, 'Q2', 60) INSERT INTO SALES VALUES (2002, 'Q3', 120) INSERT INTO SALES VALUES (2002, 'Q3', 110) INSERT INTO SALES VALUES (2002, 'Q4', 180) GO
To get a view that is useful for quarter-over-year comparisons, we want to pivot the table's Quarter column into a row heading and aggregate the sum of the values in each quarter for a year. The SQL batch that follows shows a PIVOT command that does this.
SELECT * FROM SALES PIVOT (SUM (Amount) Aggregate the Amount column using SUM FOR [Quarter] Pivot the Quarter column into column headings IN (Q1, Q2, Q3, Q4)) use these quarters AS P GO Year Q1 Q2 Q3 Q4 - - - - 2001 100 190 165 90 2002 200 250 230 180
The SELECT statement selects all the rows from SALES. A PIVOT clause is added to the SELECT statement. It starts with the PIVOT keyword followed by its body enclosed in parentheses. The body contains two parts separated by the FOR keyword. The first part of the body specifies the kind of aggregation to be performed. The argument of the aggregate function must be a column name; it cannot be an expression as it is when an aggregate function is used outside a PIVOT. The second part specifies the pivot columnthat is, the column to pivot into a rowand the values from that column to be used as column headings. The value for a particular column in a row is the aggregation of the column specified in the first part, over the rows that match the column heading.
Note that it is not required to use all the possible values of the pivot column. You only need to specify the Q2 column if you wish to analyze just the year-over-year Q2 results. The SQL batch that follows shows this.
SELECT * FROM SALES PIVOT (SUM (Amount) FOR [Quarter] IN (Q2)) AS P GO Year Q2 - 2001 190 2002 250
Figure 7-8: Rotating Properties
Figure 7-9: Basic PIVOT
Figure 7-10: Results of Open Schema Pivot
Note that the output produced by the PIVOT clause acts as though SELECT has a GROUP BY [Year] clause. A pivot, in effect, applies a GROUP BY to the SELECT that includes all the columns that are not either the aggregate or the pivot column. This can lead to undesired results, as shown in the SQL batch that follows. It uses essentially the same SALES table as the previous example, except that it has an additional column named Other.
CREATE TABLE SALES2 ( [Year] INT, Quarter CHAR(2), Amount FLOAT, Other INT ) INSERT INTO SALES2 VALUES (2001, 'Q2', 70, 1) INSERT INTO SALES2 VALUES (2001, 'Q3', 55, 1) INSERT INTO SALES2 VALUES (2001, 'Q3', 110, 2) INSERT INTO SALES2 VALUES (2001, 'Q4', 90, 1) INSERT INTO SALES2 VALUES (2002, 'Q1', 200, 1) INSERT INTO SALES2 VALUES (2002, 'Q2', 150, 1) INSERT INTO SALES2 VALUES (2002, 'Q2', 40, 1) INSERT INTO SALES2 VALUES (2002, 'Q2', 60, 1) INSERT INTO SALES2 VALUES (2002, 'Q3', 120, 1) INSERT INTO SALES2 VALUES (2002, 'Q3', 110, 1) INSERT INTO SALES2 VALUES (2002, 'Q4', 180, 1) SELECT * FROM Sales2 PIVOT (SUM (Amount) FOR Quarter IN (Q3)) AS P GO Year Other Q3 - - 2001 1 55 2002 1 115 2001 2 110
Note that the year 2001 appears twice, once for each value of Other. The SELECT that precedes the PIVOT keyword cannot specify which columns to use in the PIVOT clause. However, a subquery can be used to eliminate the columns not desired in the pivot, as shown in the SQL batch that follows.
SELECT * FROM (Select Amount, Quarter, Year from Sales2 ) AS A PIVOT (SUM (Amount) FOR Quarter IN (Q3)) AS P GO Year Q3 - 2001 165 2002 230
A column named in the FOR part of the PIVOT clause may not correspond to any values in the pivot column of the table. The column will be output, but will have null values. The following SQL batch shows this.
SELECT * FROM SALES PIVOT (SUM (Amount) FOR [Quarter] IN (Q2, LastQ)) As P GO Year Q2 LastQ - 2001 190 NULL 2002 250 NULL
Note that the Quarter column of the SALES table has no value "LastQ", so the output of the PIVOT lists all the values in the LastQ column as NULL.
Using PIVOT for Open Schemas
Using PIVOT for an open schema is really no different from using PIVOT for analysis, except that we don't depend on PIVOT's ability to aggregate a result. The open schema has two tables, a Product table and a Properties table, as was shown in Figure 7-7. What we want to do is to take selected rows from the Properties table and pivot themthat is, rotate themand then add them as columns to the Product table. This is shown in Figure 7-8.
Figure 7-9 shows the PIVOT we will use to select the line from the Product table for "Swish" products and joint them with the corresponding pivoted lines from the Properties table.
This query selects row from the Properties table that have a string equal to "color", "type", or "amount" in the value column. They are selected from the value column because value is the argument of the MAX function that follows the PIVOT keyword. The strings "color", "type", and "amount" are used because they are specified as an argument of the IN clause after the FOR keyword. Note that the arguments of the IN clause must be literal; there is no way to calculate themfor example, by using a subquery.
The results of the pivot query in Figure 7-9 are shown in Figure 7-10.
Note that the columns that were selected from the Properties table now appear as rows in the output.
Ranking and Windowing Functions
SQL Server 2005 adds support for a group of functions known as ranking functions. At its simplest, ranking adds an extra value column to the resultset that is based on a ranking algorithm being applied to a column of the result. Four ranking functions are supported.
ROW_NUMBER() produces a column that contains a number that corresponds to the row's order in the set. The set must be ordered by using an OVER clause with an ORDER BY clause as a variable. The following is an example.
SELECT orderid, customerid, ROW_NUMBER() OVER(ORDER BY orderid) AS num FROM orders WHERE orderid < 10400 AND customerid <= 'BN' produces orderid customerid num - 10248 VINET 1 10249 TOMSP 2 10250 HANAR 3 10251 VICTE 4 10252 SUPRD 5 10253 HANAR 6 10254 CHOPS 7 10255 RICSU 8 ... more rows
Note that if you apply the ROW_NUMBER function to a nonunique column, such as customerid in the preceding example, the order of customers with the same customerid (ties) is not defined. In any case, ROW_NUMBER produces a monotonically increasing number; that is, no rows will ever share a ROW_NUMBER.
SELECT orderid, customerid, ROW_NUMBER() OVER(ORDER BY customerid) AS num FROM orders WHERE orderid < 10400 AND customerid <= 'BN' produces orderid customerid num - 10308 ANATR 1 10365 ANTON 2 10355 AROUT 3 10383 AROUT 4 10384 BERGS 5 10278 BERGS 6 10280 BERGS 7 10265 BLONP 8 10297 BLONP 9 10360 BLONP 10
RANK() applies a monotonically increasing number for each value in the set. The value of ties, however, is the same. If the columns in the OVER(ORDER BY ) clause have unique values, the result produced by RANK() is identical to the result produced by ROW_NUMBER(). RANK() and ROW_NUMBER() differ only if there are ties. Here's the second earlier example using RANK().
SELECT orderid, customerid, RANK() OVER(ORDER BY customerid) AS [rank] FROM orders WHERE orderid < 10400 AND customerid <= 'BN' produces orderid customerid rank - 10308 ANATR 1 10365 ANTON 2 10355 AROUT 3 10383 AROUT 3 10384 BERGS 5 10278 BERGS 5 10280 BERGS 5 10265 BLONP 8 10297 BLONP 8 10360 BLONP 8 ... more rows
Note that multiple rows have the same rank if their customerid is the same. There are holes, however, in the rank column value to reflect the ties. Using the DENSE_RANK() function works the same way as RANK() but gets rid of the holes in the numbering. NTILE(n) divides the resultset into "n" approximately even pieces and assigns each piece the same number. NTILE(100) would be the well-known (to students) percentile. The following query shows the difference between ROW_NUMBER(), RANK(), DENSE_RANK(), and TILE(n).
SELECT orderid, customerid, ROW_NUMBER() OVER(ORDER BY customerid) AS num, RANK() OVER(ORDER BY customerid) AS [rank], DENSE_RANK() OVER(ORDER BY customerid) AS [denserank], NTILE(5) OVER(ORDER BY customerid) AS ntile5 FROM orders WHERE orderid < 10400 AND customerid <= 'BN' produces orderid customerid num rank denserank ntile5 - - 10308 ANATR 1 1 1 1 10365 ANTON 2 2 2 1 10355 AROUT 3 3 3 2 10383 AROUT 4 3 3 2 10278 BERGS 5 5 4 3 10280 BERGS 6 5 4 3 10384 BERGS 7 5 4 4 10265 BLONP 8 8 5 4 10297 BLONP 9 8 5 5 10360 BLONP 10 8 5 5
The ranking functions have additional functionality when combined with windowing functions. Windowing functions divide a resultset into partitions, based on the value of a PARTITION BY clause inside the OVER clause. The ranking functions are applied separately to each partition. Here's an example.
SELECT *, RANK() OVER(PARTITION BY COUNTRY ORDER BY age) AS [rank] FROM ( SELECT lastname, country, DATEDIFF(yy,birthdate,getdate())AS age FROM employees ) AS a produces lastname country age rank - - Dodsworth UK 37 1 Suyama UK 40 2 King UK 43 3 Buchanan UK 48 4 Leverling USA 40 1 Callahan USA 45 2 Fuller USA 51 3 Davolio USA 55 4 Peacock USA 66 5
There are separate rankings for each partition. An interesting thing to note about this example is that the subselect is required because any column used in a PARTITION BY or ORDER BY clause must be available from the columns in the FROM portion of the statement. In our case, the seemingly simpler statement that follows:
SELECT lastname, country, DATEDIFF(yy,birthdate,getdate())AS age, RANK() OVER(PARTITION BY COUNTRY ORDER BY age) AS [rank] FROM employees
wouldn't work; instead, you'd get the error "Invalid column name 'age'". In addition, you can't use the ranking column in a WHERE clause, because it is evaluated after all the rows are selected, as shown next.
10 rows to a page, we want page 40 this won't work SELECT ROW_NUMBER() OVER (ORDER BY customerid, requireddate) AS num, customerid, requireddate, orderid FROM orders WHERE num BETWEEN 400 AND 410 this will SELECT * FROM ( SELECT ROW_NUMBER() OVER (ORDER BY customerid, requireddate) AS num, customerid, requireddate, orderid FROM orders ) AS a WHERE num BETWEEN 400 AND 410
Although the preceding case looks similar to selecting the entire resultset into a temporary table, with num as a derived identity column, and doing a SELECT of the temporary table, in some cases the engine will be able to accomplish this without the complete set of rows. Besides being usable in a SELECT clause, the ranking and windowing functions are also usable in the ORDER BY clause. This gets employees partitioned by country and ranked by age, and then sorted by rank.
SELECT *, RANK() OVER(PARTITION BY COUNTRY ORDER BY age)) AS [rank] FROM ( SELECT lastname, country, DATEDIFF(yy,birthdate,getdate())AS age FROM employees ) AS a ORDER BY RANK() OVER(PARTITION BY COUNTRY ORDER BY age), COUNTRY produces lastname country age rank - - Dodsworth UK 37 1 Leverling USA 40 1 Suyama UK 40 2 Callahan USA 45 2 King UK 43 3 Fuller USA 51 3 Buchanan UK 48 4 Davolio USA 55 4 Peacock USA 66 5
You can also use other aggregate functions (either system-defined aggregates or user-defined aggregates that you saw in Chapter 5) with the OVER clause. When it is used in concert with the partitioning functions, however, you get the same value for each partition. This is shown next.
there is one oldest employee age for each country SELECT *, RANK() OVER(PARTITION BY COUNTRY ORDER BY age) AS [rank], MAX(age) OVER(PARTITION BY COUNTRY) AS [oldest age in country] FROM ( SELECT lastname, country, DATEDIFF(yy,birthdate,getdate())AS age FROM employees ) AS a produces lastname country age rank oldest age in country - - - Dodsworth UK 37 1 48 Suyama UK 40 2 48 King UK 43 3 48 Buchanan UK 48 4 48 Leverling USA 40 1 66 Callahan USA 45 2 66 Fuller USA 51 3 66 Davolio USA 55 4 66 Peacock USA 66 5 66