SQL Performance Tuning: Simple "Searches"
In this chapter, we'll talk about syntax-based optimizing and simple search conditions.
A syntax is a choice of words and their arrangement in the SQL statement. To optimize based on syntax, assume that nonsyntactic factors (e.g., indexes, table sizes, storage) are irrelevant or unknown. This is the lowest level of optimizingit's usually predictable, and some of it can be done on the client.
There's no point in attempting to optimize most SQL syntax because only certain SQL statements have options that lend themselves to optimization. The particular syntax that offers many optimization possibilities is the SQL search condition. Here are three examples of search conditions:
... WHERE title LIKE 'The %' OR title LIKE 'A %' ... WHERE name <> 'Smith' ... WHERE number = 5
Although the slowest search conditions are those that contain joins and subqueries, this chapter deals only with single-table searches. (We'll talk about joins and subqueries later.) Also, although search conditions can appear in HAVING, IF, or ON clauses, we'll talk only about search conditions in WHERE clauses. So the chapter title"Simple Searches"is an understatement. Don't worrythe complex queries will come later.
General Tuning
In this part of the chapter, we'll look at some general ideas you should keep in mind when writing simple search conditions.
Code for Points
The best search conditions are those that work on few rows with easy-to-do comparisons. Table 2-1 and Table 2-2 show typical lists (derived from vendors' manuals) of types of search conditions, in order from best to worst. Each search condition component has a "point count"the better the component, the higher the score. You can see from the allotted points shown in Tables 2-1 and 2-2 that the best search condition is something like:
... WHERE smallint_column = 12345
Table 2-1. Search Condition Point Counts for Operators
Operator |
Points |
= |
10 |
> |
5 |
>= |
5 |
< |
5 |
<= |
5 |
LIKE |
3 |
<> |
0 |
Table 2-2. Search Condition Point Counts for Operands
Operand |
Points |
Literal alone |
10 |
Column alone |
5 |
Parameter alone |
5 |
Multioperand expression |
3 |
Exact numeric data type |
2 |
Other numeric data type |
1 |
Temporal data type |
1 |
Character data type |
0 |
NULL |
0 |
This example gets a total of 27 points, calculated as follows:
Five points for the column (smallint_column) alone on the left
Two points for the exact numeric (smallint_column) operand data type
Ten points for the equals operator
Ten points for the literal (12345) alone on the right
Here's another example:
... WHERE char_column >= varchar_column || 'x'
The point count for this type of search condition is much loweronly 13:
Five points for the column (char_column) alone on the left
Zero points for the CHAR (char_column) operand data type
Five points for the greater than or equal to operator
Three points for the multioperand expression (varchar_column || 'x') on the right
Zero points for the VARCHAR (varchar_column) operand data type
The precise point count for a search condition varies from vendor to vendor, so it's pointless to memorize anything other than the order and the concept for this optimization technique. So just remember:
The condition that takes the least timeusually because it involves fewer rows or easier comparisonsgets the most points.
Armed with this concept, you can decide whether to change the order of expressions, or to substitute one expression for another that does the same work. Even though a modern cost-based DBMS optimizer has many more rules that require information outside the SQL statement itself, all DBMSs still fall back on the point count when no other information is available. The possibility is always there that you will use an item of information that the optimizer doesn't. (For more information about cost-based optimizers, see Chapter 17 "Cost-Based Optimizers.")
Another way you can optimize a search condition is by putting multiple expressions in the correct order. The expressions in this WHERE clause are already in optimal order:
SELECT * FROM Table1 WHERE column1 = 5 AND column2 = 77.3 AND column3 = 'Smith' AND column4 < 117 AND column4 > column5 GAIN: 0/8
The note at the bottom of this example says there is a GAIN: 0/8. That's an important number, and we're going to say "GAIN: x/8" in many of our examples, so let's clarify. As explained in Chapter 1, the gain shows how many of the Big Eight run faster when the search condition is optimized. Mileage varies with different data and different machines, of course. We're only reporting what our tests showed.
So "GAIN: 0/8" means "you'd be wasting your time if you rearranged this particular WHERE clause into optimum order because the DBMS does this for you." All DBMS makers know the basics of point counting, so the rearrangement is automatic. This means that, in ordinary cases, you will gain nothing by doing your own syntax-based optimization. However, there are many exceptions to the rule. For the rest of this chapter, we'll look at cases where the gain is both significant and predictable.
Constant Propagation
"All men are mortal. Socrates is a man. Therefore, Socrates is mortal."
Attributed to Aristotle
Formally, the Law of Transitivity states that:
IF (A <comparison operator> B) IS TRUE AND (B <comparison operator> C) IS TRUE THEN (A <comparison operator> C) IS TRUE AND NOT (A <comparison operator> C) IS FALSE
(Comparison operator is any one of: = or > or >= or < or <= but not one of: <> or LIKE.)
The Law of Transitivity leads to the simple observation that we can substitute C for B without changing the meaning of an expression. When such a substitution involves substituting a constant value, the process is called constant propagation.1
The next two expressions mean the same thing, but the second expression has a better point count because it substitutes a literal (5) for a column name (column1):
Expression #1 ... WHERE column1 < column2 AND column2 = column3 AND column1 = 5 Expression #2 ... WHERE 5 < column2 AND column2 = column3 AND column1 = 5 GAIN: 2/8
Expression #2 is called a transform of Expression #1. (Writing a transform means rewriting an SQL statement to produce the same result but with different syntax. When two SQL statements have different syntax, but will predictably and regularly produce the same outputs, they are known as transforms of one another.) Most good DBMSs do this sort of thing automatically. But some DBMSs won't try transforms when the expression contains multiple parentheses and NOTs. For example, this SELECT statement can be slow:
SELECT * FROM Table1 WHERE column1 = 5 AND NOT (column3 = 7 OR column1 = column2)
Applying the transforms ourselves, we came up with this statement:
SELECT * FROM Table1 WHERE column1 = 5 AND column3 <> 7 AND column2 <> 5 GAIN: 5/8
The transformed statement is faster more than half of the time. In other words, sometimes it pays to do your own transforms.
Sometimes constant propagation won't work with floats, because it's possible to be both "greater than" and "equal to" at the same time when approximate numeric comparisons happen. When it does work, though, expect a GAIN: 5/8. And sometimes, constant propagation won't work for CHAR expressions. But when it does, expect a GAIN: 4/8.
Exercise time: The MySQL online documentation has this example:
... WHERE a < b AND b = c AND a = 5
transforms to:
... WHERE b > 5 AND b = c AND a = 5
The quiz question here isDid the MySQL folks make a mistake?2
In the real world, you'll find many semiconstant operands, as program parameters or functions. Examples are the niladic functions like CURRENT_DATE (a niladic function is a function that has no arguments). Because using a constant value always accelerates accesses, try a transform to speed up these cases. Here's an example: Query #1 transforms to Query #2:
Query #1: SELECT * FROM Table1 WHERE date_column = CURRENT_DATE AND amount * 5 > 100.00 Query #2: SELECT * FROM Table1 WHERE date_column = DATE '2002-01-01' AND amount * 5 > 100.00 GAIN: 5/8
If you're thinking of transforming this type of expression, keep in mind that (because of the DATE constant), you'd have to change the query every day. That's only practical when an application program is generating the queries on the server.
Dead Code Elimination
"Branches with no leaves should be cut off."
Ortho Guide To Pruning Bushes and Shrubs
In some old SQL programs, you'll encounter literals on both sides of the comparison operator, as in this example:
SELECT * FROM Table1 WHERE 0 = 1 AND column1 = 'I hope we never execute this'
In the days before C-style /* comments */ were legal in an SQL statement, this was a way to add an inline comment string. Because the expression 0 = 1 is always false, this query will always return zero rows, and therefore DBMSs can skip the whole WHERE clause. But some of them don't. We tested this by removing the WHERE clause and got a gain:
SELECT * FROM Table1 GAIN: 5/8
It is, of course, obvious that these two queries aren't equivalentthe point is merely that it should take less time to retrieve zero rows due to an always-false condition than it should to do a full table scan-provided the DBMS doesn't evaluate the always-false condition. This example shows that DBMSs don't always throw out always-false conditions and all their dependents in the PREPARE stage. But they're pretty reliable at throwing out always-true conditions. So you can use always-true conditions for an SQL equivalent of conditional compilation. For example, if you worry that a DBMS won't give high precision for division results, add a separate condition that comes into play only when necessaryas in this example:
... WHERE (77 / 10 = 7.7 AND column1 / 10 = 7.7) OR (77 / 10 = 7 AND column1 * 10 = 77) GAIN: 5/8
Because of the unreliability aspect, though, it is usually a bad idea to put in redundant code. Suppose that a column, indexed_column, is an indexed NOT NULL column. You could transform this SQL statement:
SELECT * FROM Table1
to this statement:
SELECT * FROM Table1 WHERE indexed_column > 0
This is a way of forcing the DBMS to look up via the index. Alas, it works only with a few DBMSs. In general, then, don't add redundant conditions to WHERE clauses.
Ensure You Use the Right DBMS
There are several ways to ensure that a specific DBMS (and no other) executes an expression. Here are three examples, all of which use nonstandard SQL extensions:
Example 1: ... WHERE :variable = 'Oracle' AND /* Oracle-specific code here */ Example 2: SELECT /* ! HIGH_PRIORITY */ ... /* all DBMSs except MySQL ignore this */ Example 3: ... WHERE <escape-sequence> AND /* ODBC code */
While we're on the subject, Oracle allows you to add a comment that indicates what index you want to use. It looks like this:
SELECT /*+ INDEX(Widgets Widget_index) */ column1, column2, column3 FROM Widgets WHERE column1 <> 7; GAIN: only 1/8 because it's Oracle-specific
Oracle-specific optimizations are bad ideas if they tie you to Oracle. In this case, the hint is in a comment, so other DBMSs will ignore it. That's goodit's more portable than putting hints outside comments as Microsoft and Sybase do. So it's okayuntil other DBMSs start to put executable data inside comments too. Some are already starting to do so. So right now hints are okay, but eventually they will lead to conflict and chaos.
Constant Folding
"Go north one mile, west one mile, west one more mile, then south one mile, and you will be at your starting point."
The South Pole Riddle
Anyone who has used C will know that the expression x=1+1-1-1 is folded to x=0 at compile time. So it may surprise you that many SQL DBMSs do not fold these five obvious-looking transform candidates:
... WHERE column1 + 0 ... WHERE 5 + 0.0 ... WHERE column1 IN (1, 3, 3) ... CAST(1 AS INTEGER) ... WHERE 'a' || 'b'
If you find expressions like these in old code though, our tip isLeave them alone. They are there for historical reasons, such as forcing the DBMS to ignore indexes, changing the result data type, allowing for the difference between SMALLINT and INTEGER, or evading a limit on line size. Sorrybut the obvious-looking cases are precisely the cases where you should stop and wonder whether the original programmer had some reason for the weird syntax choice. We read a Java optimization article once that sums it up nicely: "Rule #1: Understand the code." Nevertheless, we do recommend that you transform this search condition:
... WHERE a - 3 = 5
to:
... WHERE a = 8 /* a - 3 = 5 */ GAIN: 6/8
Although it's useless in simple cases, constant folding can lead to constant propagation and is therefore A Good Thing.
Case-Insensitive Searches
Microsoft's Access Jet considers the strings 'SMITH' and 'Smith' to be equal, so Access is called a case-insensitive DBMS. Oracle, on the other hand, is usually case sensitive (the engine would say 'SMITH' and 'Smith' are unequal strings). Sybase allows you to decide about case sensitivity when you install, and a true SQL Standard DBMS will allow you to switch case sensitivity at runtime. We've seen many programmers try to ensure case insensitivity by using the fold function UPPER, as in:
... WHERE UPPER(column1) = 'SMITH'
That can be a mistake if you're dealing with strings that contain anything other than strictly Latin letters. With some DBMSs, when you translate certain French or German strings to uppercase, you lose information. For example, the function:
... UPPER('rÈsumÈ')
returns RESUMEthat is, the accent marks are lost, changing the meaning of the word from "curriculum vitae" to "begin again." Because information isn't lost going the other way, it's better to use the LOWER function, like this:
... WHERE LOWER(column1) = 'rÈsumÈ'
An even better way is to eliminate the fold function entirely if that's possible, becausewe appeal to authority hereboth the Microsoft and Oracle manuals say: "Avoid functions on columns." We're sure they mean "avoid functions on columns when there's another way to get the result needed"for example, to ensure case insensitivity, the best method is to use a case-insensitive collation rather than a fold function.
A slightly faster search assumes that the data is clean and asks for the only reasonable combinations, like this:
... WHERE column1 = 'SMITH' OR column1 = 'Smith' GAIN: 8/8
which is still slow. Our tip here isTake advantage of dead code elimination so that the 'Smith' search happens only when the DBMS is case sensitive. Here's how:
... WHERE column1 = 'SMITH' OR ('SMITH' <> 'Smith' AND column1 = 'Smith') GAIN: 3/8
Sargability
"CHUFFED. adj. [1] Pleased. [2] Displeased."
The Concise Oxford Dictionary
The ideal SQL search condition has the general form:
<column> <comparison operator> <literal>
In the early days, IBM researchers named these kinds of search conditions "sargable predicates" because SARG is a contraction for Search ARGument. In later days, Microsoft and Sybase redefined "sargable" to mean "can be looked up via the index." Alaswhen the same word has two very different meanings, it's not much use as a word any more! So we titled this section "Sargability" just for fun. But although the word is dead, the idea lives on in this rule:
The left side of a search condition should be a simple column name; the right side should be an easy-to-look-up value.
To enforce this rule, all DBMSs will transpose the expression:
5 = column1
to:
column1 = 5
When there's arithmetic involved, though, only some DBMSs will transpose. For example, we tested this transform:
... WHERE column1 - 3 = -column2
transforms to:
... WHERE column1 = -column2 + 3 GAIN: 4/8
The gain shows that doing the transform ourselves helped considerably.
On a 32-bit computer, arithmetic is fastest if all operands are INTEGERs (because INTEGERs are 32-bit signed numbers) rather than SMALLINTs, DECIMALs, or FLOATs. Thus this condition:
... WHERE decimal_column * float_column
is slower than:
... WHERE integer_column * integer_column GAIN: 5/8
You might expect "GAIN: 8/8" with the last example, but some DBMSs (Oracle is an example) don't distinguish between 16-bit and 32-bit numbers or store them in binary form. (Recall that dBASE stores numbers in ASCII. When a DBMS stores numbers in ASCII, truncation will always beat division by ten. So store using the base that you might divide by.)
The Bottom Line: General Tuning
The left side of a search condition should be a simple column name; the right side should be an easy-to-look-up value.
Each component of a search condition has a point count. The higher the points, the faster the component. The condition that takes the least time gets the most points.
Put multiple expressions in the correct order.
Use the Law of Transitivity and the concept of constant propagation to substitute a literal for a column name or column expression whenever you can do so without changing the meaning of an expression.
Some DBMSs won't fold most obvious-looking (to a C programmer) expressions. Don't use this principle as the reason to always transform such expressions when you find them in old code thoughusually they're there for historical reasons. Remember Rule #1: Understand the code before changing it.
If the code involves an obvious math expression, do evaluate it and transform the condition to the evaluated result. Constant folding can lead to constant propagation and is therefore A Good Thing.
Avoid functions on columns.
If you can't avoid functions, don't use UPPER to ensure case insensitivity. Use LOWER instead.