The MySQL Query Optimizer
When you issue a query that selects rows, MySQL analyzes it to see if any optimizations can be used to process the query more quickly. In this section, we'll look at how the query optimizer works. For additional information about optimization measures that MySQL takes, consult the optimization chapter in the MySQL Reference Manual.
The MySQL query optimizer takes advantage of indexes, of course, but it also uses other information. For example, if you issue the following query, MySQL will execute it very quickly, no matter how large the table is:
SELECT * FROM tbl_name WHERE 0;
In this case, MySQL looks at the WHERE clause, realizes that no rows can possibly satisfy the query, and doesn't even bother to search the table. You can see this by issuing an EXPLAIN statement, which tells MySQL to display some information about how it would execute a SELECT query without actually executing it.1 To use EXPLAIN, just put the word EXPLAIN in front of the SELECT statement:
mysql> EXPLAIN SELECT * FROM tbl_name WHERE 0\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Impossible WHERE
Normally, EXPLAIN returns more information than that, including non-NULL information about the indexes that will be used to scan tables, the types of joins that will be used, and estimates of the number of rows that will need to be examined from each table.
How the Optimizer Works
The MySQL query optimizer has several goals, but its primary aims are to use indexes whenever possible and to use the most restrictive index in order to eliminate as many rows as possible as soon as possible. That last part might sound backward and nonintuitive. After all, your goal in issuing a SELECT statement is to find rows, not to reject them. The reason the optimizer tries to reject rows is that the faster it can eliminate rows from consideration, the more quickly the rows that do match your criteria can be found. Queries can be processed more quickly if the most restrictive tests can be done first. Suppose that you have a query that tests two columns, each of which has an index on it:
SELECT col3 FROM mytable WHERE col1 = 'some value' AND col2 = 'some other value';
Suppose also that the test on col1 matches 900 rows, the test on col2 matches 300 rows, and that both tests together succeed on 30 rows. Testing col1 first results in 900 rows that must be examined to find the 30 that also match the col2 value. That's 870 failed tests. Testing col2 first results in 300 rows that must be examined to find the 30 that also match the col1 value. That's only 270 failed tests, so less computation and disk I/O is required. As a result, the optimizer will test col2 first because doing so results in less work overall.
You can help the optimizer take advantage of indexes by using the following guidelines:
Try to compare columns that have the same data type. When you use indexed columns in comparisons, use columns that are of the same type. Identical data types will give you better performance than dissimilar types. For example, INT is different from BIGINT. CHAR(10) is considered the same as CHAR(10) or VARCHAR(10) but different from CHAR(12) or VARCHAR(12). If the columns you're comparing have different types, you can use ALTER TABLE to modify one of them so that the types match.
Try to make indexed columns stand alone in comparison expressions. If you use a column in a function call or as part of a more complex term in an arithmetic expression, MySQL can't use the index because it must compute the value of the expression for every row. Sometimes this is unavoidable, but many times you can rewrite a query to get the indexed column to appear by itself.
The following WHERE clauses illustrate how this works. They are equivalent arithmetically, but quite different for optimization purposes:
WHERE mycol < 4 / 2 WHERE mycol * 2 < 4
For the first line, the optimizer simplifies the expression 4/2 to the value 2, and then uses an index on mycol to quickly find values less than 2. For the second expression, MySQL must retrieve the value of mycol for each row, multiply by 2, and then compare the result to 4. In this case, no index can be used. Each value in the column must be retrieved so that the expression on the left side of the comparison can be evaluated.
Let's consider another example. Suppose that you have an indexed column date_col. If you issue a query such as the one following, the index isn't used:
SELECT * FROM mytbl WHERE YEAR(date_col) < 1990;
The expression doesn't compare 1990 to an indexed column; it compares 1990 to a value calculated from the column, and that value must be computed for each row. As a result, the index on date_col is not used because performing the query requires a full table scan. What's the fix? Just use a literal date, and then the index on date_col can be used to find matching values in the columns:
WHERE date_col < '1990-01-01'
But suppose that you don't have a specific date. You might be interested instead in finding records that have a date that lies within a certain number of days from today. There are several ways to express a comparison of this typenot all of which are equally efficient. Here are three possibilities:
WHERE TO_DAYS(date_col) - TO_DAYS(CURDATE()) < cutoff WHERE TO_DAYS(date_col) < cutoff + TO_DAYS(CURDATE()) WHERE date_col < DATE_ADD(CURDATE(), INTERVAL cutoff DAY)
For the first line, no index is used because the column must be retrieved for each row so that the value of TO_DAYS(date_col) can be computed. The second line is better. Both cutoff and TO_DAYS(CURDATE()) are constants, so the right-hand side of the comparison can be calculated by the optimizer once before processing the query, rather than once per row. But the date_col column still appears in a function call, preventing use of the index. The third line is best of all. Again, the right-hand side of the comparison can be computed once as a constant before executing the query, but now the value is a date. That value can be compared directly to date_col values, which no longer need to be converted to days. In this case, the index can be used.
Don't use wildcards at the beginning of a LIKE pattern. Some string searches use a WHERE clause of the following form:
WHERE col_name LIKE '%string%'
That's the correct thing to do if you want to find a string no matter where it occurs in the column. But don't put '%' on both sides of the string simply out of habit. If you're really looking for the string only when it occurs at the beginning of the column, leave out the first '%'. Suppose that you're looking in a column containing last names for names like MacGregor or MacDougall that begin with 'Mac'. In that case, write the WHERE clause like this:
WHERE last_name LIKE 'Mac%'
The optimizer looks at the literal initial part of the pattern and uses the index to find rows that match as though you'd written the following expression, which is in a form that allows an index on last_name to be used:
WHERE last_name >= 'Mac' AND last_name < 'Mad'
This optimization does not apply to pattern matches that use the REGEXP operator. REGEXP expressions are never optimized.
Help the optimizer make better estimates about index effectiveness. By default, when you compare values in indexed columns to a constant, the optimizer assumes that key values are distributed evenly within the index. The optimizer will also do a quick check of the index to estimate how many entries will be used when determining whether the index should be used for constant comparisons. For MyISAM, InnoDB, and BDB tables, you can tell the server to perform an analysis of key values by using ANALYZE TABLE. This provides the optimizer with better information.
Use EXPLAIN to verify optimizer operation. The EXPLAIN statement can tell you whether indexes are being used. This information is helpful when you're trying different ways of writing a statement or checking whether adding indexes actually will make a difference in query execution efficiency. For examples, see "Using EXPLAIN to Check Optimizer Operation."
Give the optimizer hints when necessary. Normally, the MySQL optimizer considers itself free to determine the order in which to scan tables to retrieve rows most quickly. On occasion, the optimizer will make a non-optimal choice. If you find this happening, you can override the optimizer's choice using the STRAIGHT_JOIN keyword. A join performed with STRAIGHT_JOIN is like a cross join but forces the tables to be joined in the order named in the FROM clause.
STRAIGHT_JOIN can be specified at two points in a SELECT statement. You can specify it between the SELECT keyword and the selection list to have a global effect on all cross joins in the statement, or you can specify it in the FROM clause. The following two statements are equivalent:
SELECT STRAIGHT_JOIN ... FROM t1, t2, t3 ... ; SELECT ... FROM t1 STRAIGHT_JOIN t2 STRAIGHT_JOIN t3 ... ;
Run the query both with and without STRAIGHT_JOIN; MySQL might have some good reason not to use indexes in the order you think is best. (Check the execution plans with EXPLAIN to see how MySQL handles each statement.)
You can also use FORCE INDEX, USE INDEX, or IGNORE INDEX to give the server guidance about which indexes to prefer.
Take advantage of areas in which the optimizer is more mature. MySQL can do joins and subqueries, but subquery support is more recent, having been added in MySQL 4.1. Consequently, the optimizer has been better tuned for joins than for subqueries in some cases. This has a practical implication when you have a subquery that runs slowly. As discussed in "Rewriting Subqueries as Joins," in Chapter 2, some subqueries can be reformulated as logically equivalent joins. If your slow subquery is one of these, try writing it as a join to see if it performs better.
Test alternative forms of queries, but run them more than once. When testing alternative forms of a query (for example, a subquery versus an equivalent join), run it several times each way. If you run a query only once each of two different ways, you'll often find that the second query is faster just because information from the first query is still cached and need not actually be read from the disk. You should also try to run queries when the system load is relatively stable to avoid effects due to other activities on your system.
Avoid overuse of MySQL's automatic type conversion. MySQL will perform automatic type conversion, but if you can avoid conversions, you may get better performance. For example, if num_col is an integer column, each of these queries will return the same result:
SELECT * FROM mytbl WHERE num_col = 4; SELECT * FROM mytbl WHERE num_col = '4';
But the second query involves a type conversion. The conversion operation itself involves a small performance penalty for converting the integer and string to double to perform the comparison. A more serious problem is that if num_col is indexed, a comparison that involves type conversion may prevent the index from being used.
The opposite kind of comparison (comparing a string column to a numeric value) also can prevent use of an index. Suppose that you write a query like this:
SELECT * FROM mytbl WHERE str_col = 4;
In this case, an index on str_col cannot be used because there can be many different string values in str_col that are equal to 4 when converted to a number (for example, '4', '4.0', and '4th' ). The only way to know which values qualify is to read each one and perform the comparison.
Using EXPLAIN to Check Optimizer Operation
The EXPLAIN statement is useful for gaining insight into the execution plans that the optimizer generates for processing statements. In this section, I'll show two uses for EXPLAIN:
To see whether writing a query different ways affects whether an index can be used.
To see the effect that adding indexes to a table has on the optimizer's ability to generate efficient execution plans.
The discussion describes only those EXPLAIN output fields that are relevant for the examples. More information about the meaning of EXPLAIN output can be found in Appendix E, "SQL Syntax Reference."
Earlier, in "How the Optimizer Works," the point was made that the way you write an expression can determine whether the optimizer can use available indexes. Specifically, the discussion there used the example that of the three following logically equivalent WHERE clauses, only the third allows use of an index:
WHERE TO_DAYS(date_col) - TO_DAYS(CURDATE()) < cutoff WHERE TO_DAYS(date_col) < cutoff + TO_DAYS(CURDATE()) WHERE date_col < DATE_ADD(CURDATE(), INTERVAL cutoff DAY)
EXPLAIN allows you to check whether one way of writing an expression is better than another. To see this, let's try using each of the WHERE clauses to search for expiration column values in the member table, using a cutoff value of 30 days. However, as originally created, the member table has no index on the expiration column. To allow the relationship to be seen between index use and how an expression is written, first index the expiration column:
mysql> ALTER TABLE member ADD INDEX (expiration);
Then try EXPLAIN with each form of the expression to see what kind of execution plans the optimizer comes up with:
mysql> EXPLAIN SELECT * FROM MEMBER -> WHERE TO_DAYS(expiration) - TO_DAYS(CURDATE()) < 30\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: MEMBER type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 102 Extra: Using where mysql> EXPLAIN SELECT * FROM MEMBER -> WHERE TO_DAYS(expiration) < 30 + TO_DAYS(CURDATE())\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: MEMBER type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 102 Extra: Using where mysql> EXPLAIN SELECT * FROM MEMBER -> WHERE expiration < DATE_ADD(CURDATE(), INTERVAL 30 DAY)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: MEMBER type: range possible_keys: expiration key: expiration key_len: 4 ref: NULL rows: 6 Extra: Using where
The results for the first two statements show that the index is not used. The type value indicates how values will be read from a table. ALL means "all records will be examined." That is, a full table scan will be performed, without benefit of an index. The NULL in each of the key-related columns also indicates that no index will be used.
By contrast, the result for the third statement shows that the WHERE clause has been written such that the optimizer can use the index on the expiration column:
-
The type value indicates that it can use the index to search for a specific range of values (those less than the date given on the right side of the expression).
-
The possible_keys and key values show that the index on expiration is considered a candidate index and also is the index that actually would be used.
-
The rows value shows that the optimizer estimates that it would need to examine 6 rows to process the query. That's better than the value of 102 for the first two execution plans.
A second use for EXPLAIN is to find out whether adding indexes would help the optimizer execute a statement more efficiently. For this example, I will use just two tables that initially are unindexed. This suffices to show the effect of creating indexes. The same principles apply to more complex joins that involve many tables.
Support that we have two tables t1 and t2, each with 1,000 rows containing the values 1 to 1000. The query that we'll examine looks for those rows where corresponding values from the two tables are the same:
mysql> SELECT t1.i1, t2.i2 FROM t1, t2 WHERE t1.i1 = t2.i2; +------+------+ | i1 | i2 | +------+------+ | 1 | 1 | | 2 | 2 | | 3 | 3 | | 4 | 4 | | 5 | 5 | ...
With no indexes on either of the tables, EXPLAIN produces this result:
mysql> EXPLAIN SELECT t1.i1, t2.i2 FROM t1, t2 WHERE t1.i1 = t2.i2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1000 Extra: *************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1000 Extra: Using where
Here, ALL in the type column indicates a full table scan that examines all rows. NULL in the possible_keys column indicates that no candidate indexes were found for speeding up the query. (The key, key_len, and ref columns all are NULL as well due to the lack of a suitable index.) Using where indicates that information in the WHERE clause is used to identify qualifying rows.
These pieces of information tell us that the optimizer finds no useful information for executing the query more efficiently:
-
It will perform a full scan of t1.
-
For each row from t1, it will perform a full scan of t2, using the information in the WHERE clause to identify qualifying rows.
The rows values show the optimizer's estimates about how many rows it will need to examine at each stage of the query. The estimate is 1000 for t1 because a full scan will be done. Similarly, the estimate is 1000 for t2, but this is for each row in t1. In other words, the number of row combinations that the optimizer estimates it will examine to process the query is 1,000 x 1,000, or one million. That is highly wasteful of effort, because only 1,000 combinations actually satisfy the conditions in the WHERE clause.
To make this query more efficient, add an index on one of the joined columns and try the EXPLAIN statement again:
mysql> ALTER TABLE t2 ADD INDEX (i2); mysql> EXPLAIN SELECT t1.i1, t2.i2 FROM t1, t2 WHERE t1.i1 = t2.i2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1000 Extra: *************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 type: ref possible_keys: i2 key: i2 key_len: 5 ref: sampdb.t1.i1 rows: 10 Extra: Using where; Using index
This is an improvement. The output for t1 is unchanged (indicating that a full scan still will be done on the table), but the optimizer can process t2 differently:
-
type has changed from ALL to ref, meaning that a reference value (the value from t1) can be used to perform an index lookup to locate qualifying rows in t2.
-
The reference value is given in the ref field: sampdb.t1.i1.
-
The rows value has dropped from 1000 to 10, which shows that the optimizer believes that it will need to examine only 10 rows in t2 for each row in t1. (That is a somewhat pessimistic estimate. In fact, only one row in t2 will match each row from t1. We'll see a bit later how to help the optimizer improve this estimate.) The total estimated number of row combinations is 1,000 x 10 = 10,000. That's much better than the previous estimate of one million in the absence of any indexing.
Is there any value in indexing t1? After all, for this particular join, it's necessary to scan one of the tables, and no index is needed to do that. To see if there's any effect, index t1.i1 and run EXPLAIN again:
mysql> ALTER TABLE t1 ADD INDEX (i1); mysql> EXPLAIN SELECT t1.i1, t2.i2 FROM t1, t2 WHERE t1.i1 = t2.i2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: index possible_keys: i1 key: i1 key_len: 5 ref: NULL rows: 1000 Extra: Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 type: ref possible_keys: i2 key: i2 key_len: 5 ref: sampdb.t1.i1 rows: 10 Extra: Using where; Using index
This output is similar to that for the previous EXPLAIN, but adding the index did make some difference in the output for t1. type has changed from NULL to index and Extra has changed from blank to Using index. These changes indicate that, although a full scan of the indexed values still would be done, the optimizer now can read them directly from the index without touching the data file at all. You will see this kind of result for a MyISAM table when the optimizer knows that it can get all the information it needs by consulting only the index file. You'll also see it for InnoDB and BDB tables when the optimizer can use information solely from the index without another seek to get the data row.
One further step we can take to help the optimizer make better cost estimates is to run ANALYZE TABLE. This causes the server to generate statistics about the distribution of key values. Analyzing the tables and running EXPLAIN again yields a better rows estimate:
mysql> ANALYZE TABLE t1, t2; mysql> EXPLAIN SELECT t1.i1, t2.i2 FROM t1, t2 WHERE t1.i1 = t2.i2\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 type: index possible_keys: i1 key: i1 key_len: 5 ref: NULL rows: 1000 Extra: Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 type: ref possible_keys: i2 key: i2 key_len: 5 ref: sampdb.t1.i1 rows: 1 Extra: Using where; Using index
In this case, the optimizer now estimates that each value from t1 will match only one row in t2.
Overriding Optimization
It sounds odd, but there may be times when you'll want to defeat MySQL's optimization behavior:
To override the optimizer's table join order. Use STRAIGHT_JOIN to force the optimizer to use tables in a particular order. If you do this, you should order the tables so that the first table is the one from which the smallest number of rows will be chosen. If you are not sure which table this is, put the table with the greatest number of rows first. In other words, try to order the tables to cause the most restrictive selection to come first. Queries perform better the earlier you can narrow the possible candidate rows. Make sure to try the query with and without STRAIGHT_JOIN. There might be some reason the optimizer isn't joining tables the way you think it should, and STRAIGHT_JOIN may not actually help.
Another possibility is to use the FORCE INDEX, USE INDEX, and IGNORE INDEX modifiers after a table name in the table list of a join to tell MySQL how to use indexes. This may be helpful in cases where the optimizer doesn't make the correct choice.
To empty a table with minimal side effects. When you need to empty a MyISAM table completely, it's fastest to have the server just drop the table and re-create it based on the description stored in its .frm file. To do this, use a TRUNCATE TABLE statement:
TRUNCATE TABLE tbl_name;
The server's optimization of emptying a MyISAM table by re-creating it from scratch makes the operation extremely fast, because each row need not be deleted individually.
However, there are some side effects of TRUNCATE TABLE that may be undesirable under certain circumstances:
-
TRUNCATE TABLE does not necessarily produce an accurate count of the number of rows deleted. If you need this count, use a DELETE statement with no WHERE clause:
DELETE FROM tbl_name;
For MyISAM tables, AUTO_INCREMENT values normally are not reused when rows are deleted. (See Chapter 3, "Working with Data in MySQL.") However, emptying a table by re-creating it may reset the sequence to begin over at 1. To avoid this, use an "unoptimized" full-table DELETE statement that includes a trivially true WHERE clause:
DELETE FROM tbl_name WHERE 1;
Adding the WHERE clause forces MySQL to do a row-by-row deletion, because it must evaluate the condition for each row to determine whether to delete it. The statement executes much more slowly, but it preserves the current AUTO_INCREMENT sequence number.