Transaction Processing
Now let’s move on to an important feature in any database system: transaction processing.
A transaction is a group of one or more SQL commands treated as a unit. PostgreSQL promises that all commands within a transaction will complete or that none of them will complete. If any command within a transaction does not complete, PostgreSQL will roll back all changes made within the transaction.
PostgreSQL makes use of transactions to ensure database consistency. Transactions are needed to coordinate updates made by two or more concurrent users. Changes made by a transaction are not visible to other users until the transaction is committed. When you commit a transaction, you are telling PostgreSQL that all the changes made within the transaction are logically complete, the changes should be made permanent, and the changes should be exposed to other users. When you roll back a transaction, you are telling PostgreSQL that the changes made within the transaction should be discarded and not made visible to other users.
To start a new transaction, execute a BEGIN command. To complete the transaction and have PostgreSQL make your changes permanent, execute the COMMIT command. If you want PostgreSQL to revert all changes made within the current transaction, execute the ROLLBACK command11.
It’s important to realize that all SQL commands execute within a transaction. If you don’t explicitly BEGIN a transaction, PostgreSQL will automatically execute each command within its own transaction.
Persistence
I used to think that single-command transactions were pretty useless: I was wrong. Single-command transactions are important because a single command can access multiple rows. Consider the following: Let’s add a new constraint to the customers table.
movies=# ALTER TABLE customers ADD CONSTRAINT movies-# balance_exceeded CHECK( balance <= 50 );
This constraint ensures that no customer is allowed to have a balance exceeding $50.00. Just to prove that it works, let’s try setting a customer’s balance to some value greater than $50.00:
movies=# UPDATE CUSTOMERS SET balance = 100 where customer_id = 1; ERROR: ExecReplace: rejected due to CHECK constraint balance_exceeded
You can see that the UPDATE is rejected. What happens if you try to update more than one row? First, let’s look at the data already in the customers table:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 8 | Wink Wankel | 555-1000 | 1988-12-25 | 0.00 (5 rows)
Now, try to UPDATE every row in this table:
movies=# UPDATE customers SET balance = balance + 40; ERROR: ExecReplace: rejected due to CHECK constraint balance_exceeded
This UPDATE command is rejected because adding $40.00 to the balance for Rubin, William violates the balance_exceeded constraint. The question is, were any of the customers updated before the error occurred? The answer is: probably. You don’t really know for sure because any changes made before the error occurred are rolled back. The net effect is that no changes were made to the database:
movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 8 | Wink Wankel | 555-1000 | 1988-12-25 | 0.00 (5 rows)
If some of the changes persisted while others did not, you would have to somehow find the persistent changes yourself and revert them. You can see that single-command transactions are far from useless. It took me awhile to learn that lesson.
What about multicommand transactions? PostgreSQL treats a multicommand transaction in much the same way that it treats a single-command transaction. A transaction is atomic, meaning that all the commands within the transaction are treated as a single unit. If any of the commands fail to complete, PostgreSQL reverts the changes made by other commands within the transaction.
Transaction Isolation
I mentioned earlier in this section that the changes made within a transaction are not visible to other users until the transaction is committed. To be a bit more precise, uncommitted changes made in one transaction are not visible to other transactions12.
Transaction isolation helps to ensure consistent data within a database. Let’s look at a few of the problems solved by transaction isolation.
Consider the following transactions:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION |
T1 |
BEGIN TRANSACTION |
UPDATE customers |
T2 |
|
SET balance = balance - 3 |
|
|
WHERE customer_id = 2; |
|
|
|
T3 |
SELECT SUM( balance ) |
|
|
FROM customers; |
|
T4 |
COMMIT TRANSACTION; |
ROLLBACK TRANSACTION; |
T5 |
|
At time T1, bruce and sheila each begin a new transaction. bruce updates the balance for customer 3 at time T1. At time T3, sheila computes the SUM() of the balances for all customers, completing her transaction at time T4. At time T5, bruce rolls back his transaction, discarding all changes within his transaction. If these transactions were not isolated from each other, sheila would have an incorrect answer: Her answer was calculated using data that was rolled back.
This problem is known as the dirty read problem: without transaction isolation, sheila would read uncommitted data. The solution to this problem is known as READ COMMITTED. READ COMMITTED is one of the two transaction isolation levels supported by PostgreSQL. A transaction running at the READ COMMITTED isolation level is not allowed to read uncommitted data. I’ll show you how to change transaction levels in a moment.
There are other data consistency problems that are avoided by isolating transactions from each other. In the following scenario, sheila will receive two different answers within the same transaction:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION; |
T1 |
BEGIN TRANSACTION; |
|
T2 |
SELECT balance |
FROM customers |
|
|
WHERE customer_id = 2; |
|
|
UPDATE customers |
|
|
SET balance = 20 |
|
|
WHERE customer_id = 2; |
T3 |
|
COMMIT TRANSACTION; |
T4 |
|
|
T5 |
SELECT balance |
|
|
FROM customers |
|
|
WHERE customer_id = 2; |
|
T6 |
COMMIT TRANSACTION; |
Again, bruce and sheila each start a transaction at time T1. At T2, sheila finds that customer 2 has a balance of $15.00. bruce changes the balance for customer 2 from $15.00 to $20.00 at time T3 and commits his change at time T4. At time T5, sheila executes the same query that she executed earlier in the transaction, but this time she finds that the balance is $20.00. In some applications, this isn’t a problem; in others, this interference between the two transactions is unacceptable. This problem is known as the non-repeatable read.
Here is another type of problem:
User: bruce |
Time |
User: sheila |
BEGIN TRANSACTION; |
T1 |
BEGIN TRANSACTION; |
|
T2 |
SELECT * FROM customers; |
INSERT INTO customers VALUES |
T3 |
|
( |
|
|
6, |
|
|
‘Neville, Robert’, |
|
|
‘555-9999’, |
|
|
‘1971-03-20’, |
|
|
0.00 |
|
|
); |
|
|
COMMIT TRANSACTION; |
T4 |
|
|
T5 |
SELECT * FROM customers; |
|
T6 |
COMMIT TRANSACTION; |
In this example, sheila again executes the same query twice within a single transaction. This time, bruce has inserted a new row in between the sheila’s queries. Notice that this is not a case of a dirty read—bruce has committed his change before sheila executes her second query. At time T5, sheila finds a new row. This is similar to the non-repeatable read, but this problem is known as the phantom read problem.
The answer to both the non-repeatable read and the phantom read is the SERIALIZABLE transaction isolation level. A transaction running at the SERIALIZABLE isolation level is only allowed to see data committed before the transaction began.
In PostgreSQL, transactions usually run at the READ COMMITTED isolation level. If you need to avoid the problems present in READ COMMITTED, you can change isolation levels using the SET TRANSACTION command. The syntax for the SET TRANSACTION command is
SET TRANSACTION ISOLATION LEVEL { READ COMMITTED | SERIALIZABLE };
The SET TRANSACTION command affects only the current transaction (and it must be executed before the first DML13 command within the transaction). If you want to change the isolation level for your session (that is, change the isolation level for future transactions), you can use the SET SESSION command:
SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL { READ COMMITTED | SERIALIZABLE }
PostgreSQL version 8.0 introduces a new transaction processing feature called a SAVEPOINT. A SAVEPOINT is a named marker that you define within the stream of commands that make up a transaction. Once you’ve defined a SAVEPOINT, you can ROLLBACK any changes that you’ve made since that point without discarding changes made prior to the SAVEPOINT—in other words, you can ROLLBACK part of a transaction (the trailing part) without rolling back the entire transaction. To create a SAVEPOINT, execute a SAVEPOINT command within a transaction. The syntax for a SAVEPOINT command is very simple:
SAVEPOINT savepoint-name
The savepoint-name must follow the normal rules for an identifier; it must be unique within the first 64 characters and must start with a letter or underscore (or it must be a quoted identifier). A SAVEPOINT gives a name to a point in time; in particular, a point between two SQL commands. Consider the following sequence:
movies=# SELECT customer_id, customer_name FROM customers; customer_id | customer_name -------------+---------------------- 3 | Panky, Henry 1 | Jones, Henry 4 | Wonderland, Alice N. 2 | Rubin, William (4 rows) movies=# START TRANSACTION; START TRANSACTION movies=# INSERT INTO customers VALUES( 5, ‘Kemp, Hans’ ); INSERT 44272 1 movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 5 | Kemp, Hans | | | (5 rows)
At this point, you’ve started a new transaction and inserted a new row, but you haven’t committed your changes yet. Now define a SAVEPOINT named p1 and insert a second row:
movies=# SAVEPOINT P1; SAVEPOINT movies=# INSERT INTO customers VALUES( 6, ‘Falkstein, Gerhard’ ); INSERT 44273 1
The SAVEPOINT command inserted a marker into the transaction stream. If you execute a ROLLBACK command at this point, both of the newly inserted rows will be discarded (in other words, all of the changes you’ve made in this transaction will be rolled back):
movies=# ROLLBACK; ROLLBACK movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 (4 rows)
Now repeat the same sequence of commands, but this time around, execute a qualified ROLLBACK command, like this:
movies=# ROLLBACK TO SAVEPOINT P1; ROLLBACK movies=# SELECT * FROM customers; customer_id | customer_name | phone | birth_date | balance -------------+----------------------+----------+------------+--------- 3 | Panky, Henry | 555-1221 | 1968-01-21 | 0.00 1 | Jones, Henry | 555-1212 | 1970-10-10 | 0.00 4 | Wonderland, Alice N. | 555-1122 | 1969-03-05 | 3.00 2 | Rubin, William | 555-2211 | 1972-07-10 | 15.00 5 | Kemp, Hans | | | (5 rows)
When you ROLLBACK to a SAVEPOINT, changes made since the SAVEPOINT are discarded, but not changes made before the SAVEPOINT. So, you see that the customers table retains the first row that you inserted, but not the second row. When you ROLLBACK to a SAVEPOINT, you are still in the middle of a transaction—you must complete the transaction with a COMMIT or ROLLBACK command.
Here are a few important points to keep in mind when you’re working with SAVEPOINTs:
You can nest SAVEPOINTs. For example, if you create a SAVEPOINT named P1, then create a second SAVEPOINT named P2, you have created a nested SAVEPOINT (P2 is nested within P1). If you ROLLBACK TO SAVEPOINT P2, PostgreSQL discards any changes made since P2, but preserves changes made between P1 and P2. On the other hand, if you ROLLBACK TO SAVEPOINT P1, PostgreSQL discards all changes made since P1, including all changes made since P2. Nested SAVEPOINTs are handy when you are working with a multilevel table structure such as ORDERS and LINEITEMS (where you have multiple line items per order). If you define a SAVEPOINT prior to modifying each order, and a second, nested SAVEPOINT prior to modifying each line item, you can ROLLBACK changes made to a single line item, changes made to a single order, or an entire transaction.
You can use the same SAVEPOINT name as often as you like within a single transaction—the new SAVEPOINT simply replaces the old SAVEPOINT14. Again, this is useful when you are working with a multilevel table structure. If you create a SAVEPOINT prior to processing each line item and you give each of those SAVEPOINTs the same name, you can ROLLBACK changes made to the most recently processed line item.
If you ROLLBACK to a SAVEPOINT, the SAVEPOINT is not destroyed—you can make more changes in the transaction and ROLLBACK to the SAVEPOINT again. However, any SAVEPOINTs nested within that SAVEPOINT will be destroyed. To continue the ORDERS and LINEITEMS example, if you ROLLBACK the changes made to an ORDERS row, you also discard changes made to the LINEITEMS for that ORDER and you are destroying the SAVEPOINT that you created for the most recent line item.
If you make a mistake (such as a typing error), PostgreSQL rolls back to the most recent SAVEPOINT. That’s a very nice feature. If you’ve used PostgreSQL for any length of time, you’ve surely exercised your vocabulary after watching PostgreSQL throw out a long and complex transaction because you made a simple typing error. If you insert SAVEPOINTs in your transaction, you won’t lose as much work when your fingers fumble a table name.
Multi-Versioning and Locking
Most commercial (and open-source) databases use locking to coordinate multiuser updates. If you are modifying a table, that table is locked against updates and queries made by other users. Some databases perform page-level or row-level locking to reduce contention, but the principle is the same—other users must wait to read the data you have modified until you have committed your changes.
PostgreSQL uses a different model called multi-versioning, or MVCC for short (locks are still used, but much less frequently than you might expect). In a multi-versioning system, the database creates a new copy of the rows you have modified. Other users see the original values until you commit your changes—they don’t have to wait until you finish. If you roll back a transaction, other users are not affected—they did not have access to your changes in the first place. If you commit your changes, the original rows are marked as obsolete and other transactions running at the READ COMMITTED isolation level will see your changes. Transactions running at the SERIALIZABLE isolation level will continue to see the original rows. Obsolete data is not automatically removed from a PostgreSQL database. It is hidden, but not removed. You can remove obsolete rows using the VACUUM command. The syntax of the VACUUM command is
VACUUM [ VERBOSE ] [ ANALYZE ] [ table ]
I’ll talk about the VACUUM command in more detail in the next chapter.
The MVCC transaction model provides for much higher concurrency than most other models. Even though PostgreSQL uses multiple versions to isolate transactions, it is still necessary to lock data in some circumstances.
Try this experiment. Open two psql sessions, each connected to the movies database. In one session, enter the following commands:
movies=# BEGIN WORK; BEGIN movies=# INSERT INTO customers VALUES movies-# ( 5, ‘Manyjars, John’, ‘555-8000’, ‘1960-04-02’, 0 ); INSERT
In the other session, enter these commands:
movies=# BEGIN WORK; BEGIN movies=# INSERT INTO customers VALUES movies-# ( 6, ‘Smallberries, John’, ‘555-8001’, ‘1960-04-02’, 0 ); INSERT
When you press the Enter (or Return) key, this INSERT statement completes immediately. Now, enter this command into the second session:
movies=# INSERT INTO customers VALUES movies-# ( 5, ‘Gomez, John’, ‘555-8000’, ‘1960-04-02’, 0 );
This time, when you press Enter, psql hangs. What is it waiting for? Notice that in the first session, you already added a customer whose customer_id is 5, but you have not yet committed this change. In the second session, you are also trying to insert a customer whose customer_id is 5. You can’t have two customers with the same customer_id (because you have defined the customer_id column to be the unique PRIMARY KEY). If you commit the first transaction, the second session would receive a duplicate value error. If you roll back the first transaction, the second insertion will continue (because there is no longer a constraint violation). PostgreSQL won’t know which result to give you until the transaction completes in the first session.