Adding Indexes to a Table
Most of the tables that you have created so far have no indexes. An index serves two purposes. First, an index can be used to guarantee uniqueness. Second, an index provides quick access to data (in certain circumstances).
Here is the definition of the customers table that you created in Chapter 1:
CREATE TABLE customers ( customer_id INTEGER UNIQUE, customer_name VARCHAR(50), phone CHAR(8), birth_date DATE, balance DECIMAL(7,2) );
When you create this table, PostgreSQL will display a rather terse message:
NOTICE: CREATE TABLE / UNIQUE will create implicit index ‘customers_customer_id_key’ for table ‘customers’
What PostgreSQL is trying to tell you here is that even though you didn’t explicitly ask for one, an index has been created on your behalf. The implicit index is created so that PostgreSQL has a quick way to ensure that the values that you enter into the customer_id column are unique.
Think about how you might design an algorithm to check for duplicate values in the following list of names:
Grumby, Jonas
Hinkley, Roy
Wentworth, Eunice
Floyd, Heywood
Bowman, David
Dutton, Charles
Poole, Frank
Morbius, Edward
Farman, Jerry
Stone, Jeremy
Dutton, Charles
- Manchek, Arthur
A first attempt might simply start with the first value and look for a duplicate later in the list, comparing Grumby, Jonas to Hinkley, Roy, then Wentworth, Eunice, and so on. Next, you would move to the second name in the list and compare Hinkley, Roy to Wentworth, Eunice, then Floyd, Heywood, and so on. This algorithm would certainly work, but it would turn out to be slow as the list grew longer. Each time you add a new name to the list, you have to compare it to every other name already in the list.
A better solution would be to first sort the list:
- Bowman, David
Dutton, Charles
Dutton, Charles
Farman, Jerry
Floyd, Heywood
Grumby, Jonas
Hinkley, Roy
Manchek, Arthur
Morbius, Edward
Poole, Frank
Stone, Jeremy
- Wentworth, Eunice
After the list is sorted, it’s easy to check for duplicates—any duplicate values appear next to each other. To check the sorted list, you start with the first name, Bowman, David and compare it to the second name, Dutton, Charles. If the second name is not a duplicate of the first, you know that you won’t find any duplicates later in the list. Now when you move to the second name on the list, you compare it to the third name—now you can see that there is a duplicate. Duplicate values appear next to each other after the list is sorted. Now when you add a new name to the list, you can stop searching for duplicate values as soon as you encounter a value that sorts after the name you are adding.
An index is similar in concept to a sorted list, but it’s even better. An index provides a quick way for PostgreSQL to find data within a range of values. Let’s see how an index can help narrow a search. First, let’s assign a number to each of the names in the sorted list, just for easy reference (I’ve removed the duplicate value):
- 1 Bowman, David
2 Dutton, Charles
3 Farman, Jerry
4 Floyd, Heywood
5 Grumby, Jonas
6 Hinkley, Roy
7 Manchek, Arthur
8 Morbius, Edward
9 Poole, Frank
10 Stone, Jeremy
11 Wentworth, Eunice
Now let’s build a (simplistic) index (see Figure 3.2). The English alphabet contains 26 letters—split this roughly in half and choose to keep track of where the "Ms" start in the list. In this list, names beginning with an M start at entry number 7. Keep track of this pair (M,7) and call it the root of your index.
Figure 3.2 One-level index.
Now when you insert a new name, Tyrell, Eldon, you start by comparing it to the root. The root of the index tells you that names starting with the letter M are found starting at entry number 7. Because the list is sorted, and you know that Tyrell will sort after M, you can start searching for the insertion point at entry 7, skipping entries 1 through 6. Also, you can stop searching as soon as you encounter a name that sorts later than Tyrell.
As your list of names grows, it would be advantageous to add more levels to the index (see Figure 3.3). The letter M splits the alphabet (roughly) in half. Add a second level to the index by splitting the range between A and M (giving you G), and splitting the range between M and Z (giving you T).
Figure 3.3 Two-level index.
Now when you want to add Tyrell, Eldon to the list, you compare Tyrell against the root and find that Tyrell sorts later than M. Moving to the next layer of the index, you find that Tyrell sorts later than T, so you can jump straight to slot number 11 and insert the new value.
You can see that you can add as many index levels as you need. Each level divides the parent’s range in half, and each level reduces the number of names that you have to search to find an insertion point6.
Using an index is similar in concept to the way you look up words in a dictionary. If you have a dictionary handy, pull it off the shelf and take a close look at it. If it’s like my dictionary, it has those little thumb-tab indentations, one for each letter of the alphabet. If I want to find the definition of the word "polyglot," I’ll find the thumb-tab labeled "P" and start searching about halfway through that section. I know, because the dictionary is sorted, that "polyglot" won’t appear in any section prior to "P" and it won’t appear in any section following "P." That little thumb-tab saves a lot of searching.
You also can use an index as a quick way to check for uniqueness. If you are inserting a new name into the index structure shown earlier, you simply search for the new name in the index. If you find it in the index, it is obviously a duplicate.
I mentioned earlier that PostgreSQL uses an index for two purposes. You’ve seen that an index can be used to search for unique values. But how does PostgreSQL use an index to provide faster data access?
Let’s look at a simple query:
SELECT * FROM characters WHERE name >= ‘Grumby’ AND name < ‘Moon’;
Now assume that the list of names that you worked with before is actually a table named characters and you have an index defined for the name column, as in Figure 3.4.
Figure 3.4 Two-level index (again).
When PostgreSQL parses through the SELECT statement, it notices that you are constraining the result set to a range of names and that you have an index on the name column. That’s a convenient combination. To satisfy this statement, PostgreSQL can use the index to start searching at entry number 5. Because the rows are already sorted, PostgreSQL can stop searching as soon as it finds the first entry greater than "Moon" (that is, the search ends as soon as you hit entry number 8). This kind of operation is called a partial index scan.
Think of how PostgreSQL would process this query if the rows were not indexed. It would have to start at the beginning of the table and compare each row against the constraints; PostgreSQL can’t terminate the search without processing every row in the table. This kind of operation is called a full table scan, or table scan.
Because this kind of index can access data in sorted order, PostgreSQL can use such an index to avoid a sort that would otherwise be required to satisfy an ORDER BY clause.
In these examples, we are working with small tables, so the performance difference between a full table scan and an indexed range read is negligible. As tables become larger, the performance difference can be huge. Chapter 4, "Performance," discusses how the PostgreSQL query optimizer chooses when it is appropriate to use an index.
PostgreSQL actually supports several kinds of indexes. The previous examples show how a B-Tree index works7. Another type of index is the Hash index. A Hash index uses a technique called hashing to evenly distribute keys among a number of hash buckets. Each key value added to a hash index is run through a hashing function. The result of a hashing function is a bucket number. A simplistic hashing function for string values might sum the ASCII value of each character in the string and then compute the sum modulo the number of buckets to get the result. In C, you might write this function as
int hash_string( char * key, int bucket_count ) { int hash = 0; int i; for( i = 0; i < strlen( key ); i++ ) hash = hash + key[i]; return( hash % bucket_count ); }
Let’s run each of the names in the characters table through this function to see what kind of numbers you get back (I’ve used a bucket_count of 5):
hash_string() Value |
Name |
1 |
Grumby, Jonas |
2 |
Hinkley, Roy |
3 |
Wentworth, Eunice |
4 |
Floyd, Heywood |
4 |
Bowman, David |
3 |
Dutton, Charles |
3 |
Poole, Frank |
0 |
Morbius, Edward |
0 |
Farman, Jerry |
0 |
Stone, Jeremy |
4 |
Manchek, Arthur |
The numbers returned don’t really have any intrinsic meaning, they simply serve to distribute a set of keys amongst a set of buckets.
Now let’s reformat this table so that the contents are grouped by bucket number:
Bucket Number |
Bucket Contents |
0 |
Morbius, Edward Farman, Jerry Stone, Jeremy |
1 |
Grumby, Jonas |
2 |
Hinkley, Roy |
3 |
Wentworth, Eunice Dutton, Charles Poole, Frank |
4 |
Floyd, Heywood Bowman, David Manchek, Arthur |
You can see that the hash function (hash_string()) did a respectable job of distributing the names between the five hash buckets. Notice that we did not have to assign a unique hash value to each key—hash keys are seldom unique. The important feature of a good hash function is that it distributes a set of keys fairly evenly. Now that you have a Hash index, how can you use it? First, let’s try to insert a new name: Lowell, Freeman. The first thing you do is run this name through your hash_string() function, giving you a hash value of 4. Now you know that if Lowell, Freeman is already in the index, it will be in bucket number 4; all you have to do is search that one bucket for the name you are trying to insert.
There are a couple of important points to note about Hash indexes.
First, you may have noticed that each bucket can hold many keys. Another way to say this is that each key does not have a unique hash value. If you have too many collisions (that is, too many keys hashing to the same bucket), performance will suffer. A good hash function distributes keys evenly between all hash buckets.
Second, notice that a hash table is not sorted. The name Floyd, Heywood hashes to bucket 4, but Farman, Jerry hashes to bucket 0. Consider the SELECT statement that we looked at earlier:
SELECT * FROM characters WHERE name >= ‘Grumby’ AND name < ‘Moon’;
To satisfy this query using a Hash index, you have to read the entire contents of each bucket. Bucket 0 contains one row that meets the constraints (Farman, Jerry), bucket 2 contains one row, and bucket 4 contains one row. A Hash index offers no advantage to a range read. A Hash index is good for searches based on equality. For example, the SELECT statement
SELECT * FROM characters WHERE name = ‘Grumby, Jonas’;
can be satisfied simply by hashing the string that you are searching for. A Hash index is also useful when you are joining two tables where the join constraint is of the form table1-column = table2-column8. A Hash read cannot be used to avoid a sort required to satisfy an ORDER BY clause.
PostgreSQL supports two other types of index structures: the R-Tree index and the GiST index. An R-Tree index is best suited for indexing spatial (that is, geometric or geographic) data. A GiST index is a B-Tree index that can be extended by defining new query predicates9. More information about GiST indexes can be found at http://gist.cs.berkeley.edu/.
Tradeoffs
The previous section showed that PostgreSQL can use an index to speed the process of searching for data within a range of values (or data with an exact value). Most queries (that is, SELECT commands) in PostgreSQL include a WHERE clause to limit the result set. If you find that you are often searching for results based on a range of values for a specific column or group of columns, you might want to consider creating an index that covers those columns.
However, you should be aware that an index represents a performance tradeoff. When you create an index, you are trading read performance for write performance. An index can significantly reduce the amount of time it takes to retrieve data, but it will also increase the amount of time it takes to INSERT, DELETE, and UPDATE data. Maintaining an index introduces substantial overhead when you modify the data within a table.
You should consider this tradeoff when you feel the need to add a new index to a table. Adding an index to a table that is updated frequently will certainly slow the updates. A good candidate for an index is a table that you SELECT from frequently but seldom update. A customer list, for example, doesn’t change often (possibly several times each day), but you probably query the customer list frequently. If you find that you often query the customer list by phone number, it would be beneficial to index the phone number column. On the other hand, a table that is updated frequently, but seldom queried, such as a transaction history table, would be a poor choice for an index.
Creating an Index
Now that you have seen what an index can do, let’s look at the process of adding an index to a table. The process of creating a new index can range from simple to somewhat complex.
Let’s add an index to the rentals table. Here is the structure of the rentals table for reference:
CREATE TABLE rentals ( tape_id CHARACTER(8) REFERENCES tapes, customer_id INTEGER REFERENCES customers, rental_date DATE );
The syntax for a simple CREATE INDEX command is
CREATE [UNIQUE] INDEX index-name ON table-name( column [,...] );
You want to index the rental_date column in the rentals table:
CREATE INDEX rentals_rental_date ON rentals ( rental_date );
You haven’t specified any optional information in this command (I’ll get to the options in a moment), so PostgreSQL creates a B-Tree index named rentals_rental_date. PostgreSQL considers using this whenever it finds a WHERE clause that refers to the rental_date column using the <, <=, =, >=, or > operator. This index also can be used when you specify an ORDER BY clause that sorts on the rental_date column.
The index-name must be unique within the database: You can’t have two indexes with the same name, even if they are defined on different tables. New rows are indexed as they are added, and deleted rows are removed. If you change the rental_date for a given row, the index will be updated automatically. If you have any data in the rentals table, each row will be included in the index.
If you don’t specify an index type when creating an index, you’ll get a B-Tree index. Let’s change the rentals_rental_date index into a Hash index. First, drop the original index:
DROP INDEX rentals_rental_date;
Then you can create a new index:
CREATE INDEX rentals_rental_date ON rentals USING HASH ( rental_date );
The only difference between this CREATE INDEX command and the previous one is that I have included a USING clause. You can specify USING BTREE (which is the default), USING HASH, USING RTREE, or USING GIST.
This index cannot be used to satisfy an ORDER BY clause. In fact, this index can be used only when rental_date is compared using the = operator.
I dropped the B-Tree index before creating the Hash index, but that is not strictly necessary. It is perfectly valid (but unusual) to have two or more indexes that cover the same column, as long as the indexes are uniquely named. If we had both a B-Tree index and a Hash index covering the rental_date column, PostgreSQL could use the Hash index for = comparisons and the B-Tree index for other comparisons.
Functional Indexes and Partial Indexes
Now let’s look at two variations on the basic index types: functional indexes and partial indexes.
A column-based index catalogs the values found in a column (or a set of columns). A functional index (or more precisely a function-valued index) catalogs the values returned by a given function. This might be easiest to understand by looking at an example. Each row in the customers table contains a phone number. You can use the exchange10 portion of the phone number to determine whether a given customer is located close to your store. For example, you may know that the 555, 556, and 794 exchanges are within five miles of your virtual video store. Let’s create a function that extracts the exchange from a phone number:
-- exchange_index.sql -- CREATE OR REPLACE FUNCTION get_exchange( CHARACTER ) RETURNS CHARACTER AS ‘ DECLARE Result CHARACTER(3); BEGIN result := SUBSTR( $1, 1, 3 ); return( result ); END; ‘ LANGUAGE ‘plpgsql’ WITH ( ISCACHABLE );
Don’t be too concerned if this looks a bit confusing; I’ll cover the PL/pgSQL language in more detail in Chapter 7, "PL/pgSQL." This function (get_exchange()) accepts a single argument, presumably a phone number, and extracts the first three characters. You can call this function directly from psql:
movies=# SELECT customer_name, phone, get_exchange( phone ) movies-# FROM customers; customer_name | phone | get_exchange ----------------------+----------+------------ Jones, Henry | 555-1212 | 555 Rubin, William | 555-2211 | 555 Panky, Henry | 555-1221 | 555 Wonderland, Alice N. | 555-1122 | 555 Wink Wankel | 555-1000 | 555
You can see that given a phone number, get_exchange() returns the first three digits. Now let’s create a function-valued index that uses this function:
CREATE INDEX customer_exchange ON customers ( get_exchange( phone ));
When you insert a new row into a column-based index, PostgreSQL will index the values in the columns covered by that index. When you insert a new row into a function-valued index, PostgreSQL will call the function that you specified and then index the return value.
After the customer_exchange index exists, PostgreSQL can use it to speed up queries such as
SELECT * FROM customers WHERE get_exchange( phone ) = ‘555’; SELECT * FROM customers ORDER BY get_exchange( phone );
Now you have an index that you can use to search the customer list for all customers that are geographically close. Let’s pretend that you occasionally want to send advertising flyers to those customers closest to you: you might never use the customer_exchange index for any other purpose. If you need the customer_exchange index for only a small set of customers, why bother maintaining that index for customers outside of your vicinity? This is where a partial index comes in handy. When you create an index, you can include a WHERE clause in the CREATE INDEX command. Each time you insert (or update) a row, the WHERE clause is evaluated. If a row satisfies the constraints of the WHERE clause, that row is included in the index; otherwise, the row is not included in the index. Let’s DROP the customer_exchange index and replace it with a partial, function-valued index:
movies=# DROP INDEX customer_exchange; DROP movies=# CREATE INDEX customer_exchange movies-# ON customers ( get_exchange( phone )) movies-# WHERE movies-# get_exchange( phone ) = ‘555’ movies-# OR movies-# get_exchange( phone ) = ‘556’ movies-# OR movies-# get_exchange( phone ) = ‘794’; CREATE
Now the customer_exchange partial index contains entries only for customers in the 555, 556, or 794 exchange.
There are three performance advantages to a partial index:
A partial index requires less disk space than a full index.
Because fewer rows are cataloged in a partial index, the cost of maintaining the index is lower.
When a partial index is used in a query, PostgreSQL will have fewer index entries to search.
Partial indexes and function-valued indexes are variations on the four basic index types. You can create a function-valued Hash index, B-Tree index, R-tree index, or GiST index. You can also create a partial variant of any index type. And, as you have seen, you can create partial function-valued indexes (of any type). A function-valued index doesn’t change the organization of an index—just the values that are actually included in the index. The same is true for a partial index.
Creating Indexes on Array Values
Most indexes cover scalar-valued columns (columns that store a single value). PostgreSQL also allows you to define indexes that cover index values. In fact, you can create an index that covers the entire array or (starting with PostgreSQL version 7.4) an index that covers individual elements within an array. In Chapter 2 we showed you a modified version of the customers table that included an array column (monthly_balances). You can add this column to your working copy of the customers table with the following command:
movies=# ALTER TABLE customers movies-# ADD COLUMN movies-# monthly_balances DECIMAL( 7, 2 )[ 12 ]; ALTER TABLE
To create an index that covers a single element of monthly_balances array (say, the element corresponding to the month of February), you could execute the following command:
movies=# CREATE INDEX customers_feb movies-# ON customers(( monthly_balances[2] )); CREATE INDEX
Notice that you need an extra set of parentheses around monthly_balances[2]. Once you’ve created the customers_feb index, PostgreSQL can use it to satisfy queries such as
movies=# SELECT * FROM customers WHERE monthly_balances[2] = 10; movies=# SELECT * FROM customers ORDER BY monthly_balances[2];
To create an index that covers the entire monthly_balances array, execute the command
movies=# CREATE INDEX customers_by_monthly_balance movies-# ON customers( monthly_balances ); CREATE INDEX
When you create an index that covers an array column, the syntax is the same as you would use to cover a scalar (single-valued) column. The PostgreSQL optimizer can use the customers_by_monthly_balance index to satisfy an ORDER BY clause such as
movies=# SELECT * FROM customers ORDER BY monthly_balances;
However, you may be surprised to find that the optimizer will not use customers_by_monthly_balance to satisfy a WHERE CLAUSE such as
movies=# SELECT * FROM customers WHERE monthly_balances[1] = 10;
The PostgreSQL optimizer will use the customers_by_monthly_balance index to satisfy a WHERE_CLAUSE that compares the entire monthly_balances array against another array, like this:
movies=# SELECT * FROM customers WHERE monthly_balances = ‘{10}’;
But be aware that these queries are not equivalent. The first WHERE clause (monthly_balances[1] = 10) selects any row where monthly_balances[1] is equal to 10, regardless of the other monthly_balances in that row. The second WHERE clause (monthly_balances = ‘{10}’) selects only those rows where monthly_balances[1] = 10 and all other monthly_balances values are NULL.
Indexes and Tablespaces
When you create an index, you can tell PostgreSQL to store the index in a specific tablespace by including a TABLESPACE tablespacename clause, like this:
CREATE INDEX rentals_rental_date ON rentals ( rental_date ) TABLESPACE mytablespace;
If you don’t specify a tablespace, PostgreSQL creates the index in the tablespace assigned to the table that you are indexing. You can move an existing index to a different tablespace using the ALTER INDEX command. For example, to move the rentals_rental_date index to mytablespace, you would execute the command
ALTER INDEX rentals_rental_date SET TABLESPACE mytablespace;
You may want to store a table and its indexes in different tablespaces in order to spread the workload among multiple physical disk drives.