Writing Advanced SQL Code
For performing complex operations, PostgreSQL supports a pool of features that can make the life of a database developer much easier. Features such as arrays and datatypes for storing network information are often essential for building sophisticated database applications.
Another important topic is masquerading. In the next section, you take a closer look at the idea behind masquerading, and you will learn in which situations characters have to be masqueraded.
Masquerading
Masquerading characters is extremely important and can save you a lot of time.
The purpose of character masquerading can be explained in a few words. Imagine a situation where you have to insert a string into a field that contains special characters. In SQL, characters such as single quotes (') have to be masqueraded because the database must distinguish between characters that will be inserted in a field and characters that are syntax components.
Assume that we want to add a French title to table prodtext. The title of the book in our example is Qu'est-ce que Python, which includes a single quote in the title. We try the following SQL command:
shop=# INSERT INTO prodtext(prodid, lang, proddesc) VALUES (385343,'french','Qu'est-ce que Python'); shop'#
The database did not perform the query and is still prompting for some input. If you look closer at the value in the column called proddesc, you can see that the database has problems distinguishing between the single quotes used for the SQL command and the single quote in the text that has to be added. There are two possibilities to solve the problem: You can use a backslash or use a second single quote to make the INSERT command work properly; here is the code for the two methods:
shop=# INSERT INTO prodtext(prodid, lang, proddesc) VALUES (385343,'french','Qu''est-ce que Python'); INSERT 51554 1 shop=# INSERT INTO prodtext(prodid, lang, proddesc) VALUES (385343,'french','Qu\'est-ce que Python'); INSERT 51555 1
Both methods work the same way. The following shows how the values are stored in the database:
shop=# SELECT * FROM prodtext WHERE prodid='385343' AND lang='french'; id | prodid | lang | proddesc ----+--------+--------+---------------------- 11 | 385343 | french | Qu'est-ce que Python 12 | 385343 | french | Qu'est-ce que Python (2 rows)
The same result can be reached with the following query. We look for all records that contain a single quote. Because we have inserted only two records, the result will be the same as before:
SELECT * FROM prodtext WHERE lang='french' AND proddesc ~ '\'';
Not only single quotes have to be treated in a special way when dealing with regular expressions. Other characterssuch as $ and ^, for instancehave to be masqueraded. This also can be done with the help of a backslash.
Deciding which characters to masquerade is easy: Do it for all characters that have a special meaning in regular expressions and try it with a backslash.
Working with Arrays
Arrays are data structures that enable the user to store multiple values of a certain datatype in one data structure in one field. Arrays are so called non-atomic values, which means that they consist of components. The size of an array can be limited or unlimited depending on the users' needs.
The next example creates a table that contains a one-dimensional array for storing comments. The array is of undefined size, which means unlimited:
CREATE TABLE comments (id serial, prodid int8, comment text[]);
We will now insert a dataset into the table:
INSERT INTO comments(prodid, comment) VALUES ('385343', '{"language","programming","Python"}');
The array is used to store three values in one column. The syntax of the INSERT command is not too difficult. The values are actually passed to the database in single quotes. The brackets within the two single quotes contain the three values of the arrayall in double quotes.
Let's see how the values are returned when selecting the record from the table:
id | prodid | comment ----+--------+------------------------------------- 1 | 385343 | {"language","programming","Python"} (1 row)
If you want to retrieve values from the array, you use a SELECT statement. The WHERE clause has to be modified slightly by adding the index of the array to the name of the column. You can see in the next example that it is not enough to use the name of the column only:
shop=# SELECT * FROM comments WHERE comment='programming'; ERROR: array_in: Need to specify dimension
Here is the correct version of the query:
shop=# SELECT * FROM comments WHERE comment[1]='programming'; id | prodid | comment ----+--------+--------- (0 rows) shop=# SELECT * FROM comments WHERE comment[2]='programming'; id | prodid | comment ----+--------+------------------------------------- 1 | 385343 | {"language","programming","Python"} (1 row)
The examples show that a record can be retrieved from an array only when the correct index of the array is specified in the query. This can sometimes be very tricky; because every array might have a different size, the user might not know about the required records.
Arrays also can be defined with a maximum size, by adding the size between the parentheses in the CREATE TABLE statement:
CREATE TABLE comments (id serial, prodid int8, comment text[5]);
The example creates a table with an array with a maximum of 5 entries.
Arrays are not limited to one dimension. In many cases, you can use multidimensional data structures. Here is an example where we create a table with a two-dimensional data structure:
CREATE TABLE comments (id serial, comments text[][]);
The table is used to store comments. We use the array to store the category and the text of the comment. Now let's insert some data into the table:
INSERT INTO comments(comments) VALUES ('{{"language"}, {"Python"}}');
The table contains one record now. The category is set to language; the text of the category is Python. We can easily retrieve the data from the table:
shop=# SELECT * FROM comments WHERE comments[1][1] ~ '.'; id | comments ----+--------------------------- 2 | {{"language"},{"Python"}} (1 row) shop=# SELECT * FROM comments WHERE comments[1][2] ~ '.'; id | comments ----+---------- (0 rows)
In this example, the columns of the array have to be indexed correctly to receive a result from the database. A two-dimensional array can be compared with a system of coordinates. Every point in the system can be assigned to a value. Multidimensional arrays can be useful for scientific databases. Assume a query where you want to know all employees in a certain area, their income, their profession, and their gender. The output of that query can be a three-dimensional array. The number of employees is assigned to every combination of income, profession, and gender. The required result (for example, income = 2000$, gender = female, profession = lawyer) can be found by indexing the array containing all results. The algorithm just described is used in a real-world situation because this is (possibly) the only way of keeping the load of the database low even when generating a lot of data.
Arrays and the COPY Command
Inserting data cannot always be done with INSERT commands, because this would be too slow when treating huge amounts of data. The next example shows how simple it is to use the COPY command to insert data into an array. The database uses nearly the same syntax for the array as the INSERT command:
COPY "comments" FROM stdin; 2 {{"language"},{"Python"}} \.
Working with BLOBs
Many advanced databases, including PostgreSQL, support treatment of a binary large object (BLOB). Storing a BLOB means storing a file. It does not matter what's in that file; it can be binary data, ASCII, EBCDIC, and so forth. BLOBs have to be treated differently than "ordinary" data, because it would be complicated to use a plain 500MB file within an INSERT command like ordinary text. BLOBs have to be loaded into the database.
Note
BLOBs don't store the position of the file in the file system. BLOBs are stored directly in the database.
The file upload can be done with the lo_import() function. If you want to export the file, the lo_export() function can be used to export the data in the database to a file in the file system.
Let's create a table where we want to store the product id and the image of a product directly within the database:
CREATE TABLE prodimages (id serial, prodid int8, image OID);
Recall that OID is the acronym for object identifier. In Chapter 7, "Backup and Recovery," you will see what that means when you want to back up your complete database.
For now, let's insert a picture in the database:
shop=# INSERT INTO prodimages(prodid, image) VALUES ('385343', lo_import('/tmp/385343.jpg')); INSERT 51759 1
The database generates a unique object id for the file and loads it into the database where it is stored. The unique id of our file in the example is 51745:
shop=# SELECT * FROM prodimages; id | prodid | image ----+--------+------- 1 | 385343 | 51745 (1 row)
You can see in the example that a SELECT statement returns only the object id of the file we have just uploaded. Internally, PostgreSQL creates separate files for every file uploaded:
bash-2.04$ ls -l *51745* -rw------- 1 postgres postgres 8192 Apr 7 15:18 xinv51745 -rw------- 1 postgres postgres 16384 Apr 7 15:18 xinx51745
Now we want to export the file stored in the record to a file. This can be done with the help of the lo_export() function. The syntax is similar to the syntax of the lo_import() function:
SELECT lo_export(prodimages.image,'/tmp/export.jpg') FROM prodimages WHERE prodid='385343';
The database exports the image to the file /tmp/export.jpg. If the files in the database become very big, this might take some timeso watch out when writing Web applications. If a user wants to upload large files into the database, this can lead to trouble. Make sure that the maximum size of the file is limited to a certain size.
Now we want to delete and reinsert the records. You must keep one issue in mind when deleting BLOBs. Never perform DELETE commands on tables containing BLOBs before you have deleted the actual image in the database. If you delete the records, the files in the database containing the image won't be deletedthe database deletes only the object idbut the data will become something like a zombie. So make sure that you use the lo_unlink() function to drop the image first:
SELECT lo_unlink(prodimages.image) FROM prodimages;
The file has now been deleted from the database; let's see whether the files are still there:
bash-2.04$ ls -l *51745* 2> /dev/null | wc -l 0
The files has been found zero times; everything is okay so far, but what about the record in the table?
shop=# SELECT * FROM prodimages; id | prodid | image ----+--------+------- 1 | 385343 | 51745 (1 row)
You can see that the record is still in the table. You can perform a DELETE operation now to safely eliminate the entry.
If you are working on high-availability systems, make sure that no one can access a record whose file has already been deleted. Implement some kind of locking mechanism, such as a locking flag for the datasets.
Using Network-Specific Datatypes
PostgreSQL has special datatypes for storing IP and MAC addresses. These datatypes offer input checking and some special functions and operators. IPs and MAC addresses can also be stored in ASCII text fields, but we strongly recommend using PostgreSQL's functions for network datatypes. Table 3.3 lists PostgreSQL network datatypes (for version 7.0.3).
Table 3.3 PostgreSQL Network Datatypes
Name |
Storage |
Range |
cidr |
12 bytes |
Stores valid IPv4 network addresses |
inet |
12 bytes |
Stores IPv4 hosts or network addresses |
macaddr |
6 bytes |
Stores MAC addresses (Ethernet card hardware addresses) |
The next example creates a table to store logging information in our shop database and shows you how to treat network addresses in PostgreSQL:
CREATE TABLE shoplog(id serial, fieldone cidr, fieldtwo inet, mac macaddr);
The table contains one field of each datatype. Let's see how we can insert valid data into it:
INSERT INTO shoplog(fieldone, fieldtwo, mac) VALUES ('192.168.1.0/24', '192.168.1.1/25', '00:50:FC:1E:7C:26');
The most important difference between cidr and inet is that cidr does not support data with nonzero bits to the right of the netmask. As you can see in the previous example, the netmask has to be written as one number instead of something like 255.255.255.0; otherwise, the system will report a parse error:
shop=# INSERT INTO shoplog(fieldone, fieldtwo, mac) VALUES ('192.168.1.0/255.255.255.0', '192.168.1.1/25', '00:50:FC:1E:7C:26'); ERROR: could not parse "192.168.1.0/255.255.255.0"
If we now select all data from the table, we get this result:
shop=# SELECT * FROM shoplog; id | fieldone | fieldtwo | mac ----+--------------+----------------+------------------- 2 | 192.168.1/24 | 192.168.1.1/25 | 00:50:fc:1e:7c:26 (1 row)
Every column has a valid IP address. In fieldone, the zeros on the right edge of the IP address are omitted because this is redundant information (the database has the netmask of the IP). If we insert an IP address without explicitly adding the netmask, the database takes a default netmask of 32 for the cidr datatype; the inet datatype won't have a netmask:
INSERT INTO shoplog(fieldone, fieldtwo, mac) VALUES ('192.168.1.0', '192.168.1.1/25', '00:50:FC:1E:7C:26'); INSERT INTO shoplog(fieldone, fieldtwo, mac) VALUES ('192.168.1.1', '192.168.1.1/25', '00:50:FC:1E:7C:26'); INSERT INTO shoplog(fieldone, fieldtwo, mac) VALUES ('0.0.0.0','0.0.0.0','00:50:FC:1E:7C:26'); shop=# SELECT * FROM shoplog; id | fieldone | fieldtwo | mac ----+----------------+----------------+------------------- 2 | 192.168.1/24 | 192.168.1.1/25 | 00:50:fc:1e:7c:26 3 | 192.168.1.0/32 | 192.168.1.1/25 | 00:50:fc:1e:7c:26 4 | 192.168.1.1/32 | 192.168.1.1/25 | 00:50:fc:1e:7c:26 5 | 0.0.0.0/32 | 0.0.0.0 | 00:50:fc:1e:7c:26 (4 rows)
In the last three lines of the result, you see what happens when performing the three INSERT statements.
PostgreSQL supports a number of input formats for the MAC address. Here is a complete list of the most important ways (for PostgreSQL 7.0.3):
00-50-FC-1E-7C-26
0050FC:1E7C26
0050FC-1E7C26
0050.FC1E.7C26
Working with Geo-Data
PostgreSQL offers datatypes for geometric operations. PostgreSQL can therefore be used to build powerful solutions to fit scientific demands that can hardly be done with commercial mainstream databases. This section gives a brief insight into the wonderful world of modeling geographic data with PostgreSQL.
Geometric objects can be stored with the help of the datatypes described in the following sections.
point
A point is, in a way, the fundamental data structure when dealing with geometric objects point and can be handled easily and efficiently by the user. A point consists of two values: The first value is the x-axis coordinate, the second value is the y-axis coordinate. Both values are internally stored as floating-point numbers.
The syntax of a point is (x, y).
A point needs 16 bytes of storage:
CREATE TABLE temppoint(fieldname point); INSERT INTO temppoint(fieldname) VALUES ('1,2'); INSERT INTO temppoint(fieldname) VALUES ('(1,3)');
We have created a table called temppoint with one column, and we have inserted two records into the table. One very important detail when dealing with a point is that a it has to be written within single quotes. Otherwise, a syntax error will be displayed:
shop=# INSERT INTO temppoint(fieldname) VALUES (1,2); ERROR: Attribute 'fieldname' is of type 'point' but expression is of type 'int4' You will need to rewrite or cast the expression
Let's see how the two points are returned by a query:
shop=# SELECT * FROM temppoint; fieldname ----------- (1,2) (1,3) (2 rows)
line
A line is a pair of two points. One point matches the beginning of a line; the other point matches the end of the line. A line can be defined as two points: (x1, y1), (x2, y2) or to make the whole thing a little clearer ( (x1, y1), (x2, y2) ).
A line can also be specified by using x1, y1, x2, y2, but we strongly recommend using one of the versions with parentheses to make the code clearer and easier to understand.
A line needs 32 bytes of storage (twice the storage of a point). lseg (line seqment) is an alternative datatype to line.
Let's create a table using line:
CREATE TABLE templine(fieldname line);
The table can already be created, but PostgreSQL 7.0.3 doesn't support INSERT statements yet:
shop=# INSERT INTO templine(fieldname) VALUES ('(1,3), (4,12)'); ERROR: line not yet implemented At the time this book is written, full support of line is not available. You can use path instead of line.
box
A box is used to store a rectangle. A rectangle is defined by two points that are situated opposite each other on the rectangle. The syntax is therefore the same as the syntax of lines. When entering a box into the database, the lower-left corner is determined from the input and stored in the database first. You need not define the lower-left of the rectangle first; this is done by the database internally, and the user does not need to care about the order of the input (not implemented in PostgreSQL 7.0.3 yet).
A box needs 32 bytes of storage.
Here is some code:
CREATE TABLE tempbox(fieldname box); INSERT INTO tempbox(fieldname) VALUES ('(8,9), (1,3)'); INSERT INTO tempbox(fieldname) VALUES ('(1,3), (4,12)'); shop=# SELECT * FROM tempbox; fieldname -------------- (8,9),(1,3) (4,12),(1,3) (2 rows)
path
A path is a sequence of points and can either be open or closed. Closed means that the last point of the path "returns" to the beginning. The length of a path is dynamic. PostgreSQL offers special functions to make sure that a path is either open or closed. popen() and pclose() can be used to force a path to be open or closed. In queries, isopen() and isclosed() can be used to check whether the path is open or closed.
A path needs 4 bytes plus 32 bytes for every node to be stored.
Here are some examples of how you can create a table and insert some values. You can see that a path does not have a fixed number of points.
CREATE TABLE temppath(fieldname path); INSERT INTO temppath(fieldname) VALUES ('(1,3), (4,12)'); INSERT INTO temppath(fieldname) VALUES ('(3,1), (2,8), (10,4)'); shop=# SELECT fieldname FROM temppath; fieldname ---------------------- ((1,3),(4,12)) ((3,1),(2,8),(10,4)) (2 rows)
polygon
A polygon is actually a closed path, but is stored differently and therefore has its own function.
A polygon needs 4 bytes plus 32 bytes for every node to be stored.
Here is some sample code to create a table and to insert some data:
CREATE TABLE temppoly(fieldname polygon); INSERT INTO temppoly(fieldname) VALUES ('(1,3), (4,12), (2,4)'); INSERT INTO temppoly(fieldname) VALUES ('(2,0), (-5,12), (2,0)');
shop=# SELECT * FROM temppoly; fieldname ----------------------- ((1,3),(4,12),(2,4)) ((2,0),(-5,12),(2,0)) (2 rows)
circle
A circle consists of a point and the radius, and needs 24 bytes of storage.
The following creates a table and inserts two values:
shop=# CREATE TABLE tempcircle(fieldname circle); CREATE shop=# INSERT INTO tempcircle(fieldname) VALUES ('10, 4, 10'); INSERT 51916 1 shop=# INSERT INTO tempcircle(fieldname) VALUES ('10, 4, -10'); ERROR: Bad circle external representation '10, 4, -10' shop=# INSERT INTO tempcircle(fieldname) VALUES ('10.3 , 2, 4'); INSERT 51917 1
You can see in the second INSERT statement that the radius of a circle must not be negative and that the database makes sure that no trash is inserted into the table.
Retrieving Geo-Data and Operators
Retrieving data from fields that contain geographic data is a little more complex than retrieving "ordinary" data.
Remember that the ~= operator has to be used instead of the = operator. In the following example, we select a specific point from a table:
SELECT fieldname FROM temppoint WHERE fieldname ~= '(1,2)';
PostgreSQL's geographic datatypes offer much more than just retrieving. The next piece of code calculates the distance of two points:
SELECT '(1,1)'::point <-> '(2,2)'::point; The result is ?column? ----------------- 1.4142135623731 (1 row)
which is exactly the square root of 2 (for further information see "Pythagoras' Theorem" at http://www.geocities.com/researchtriangle/system/8956/problems/pyth.htm or http://www.cut-the-knot.com/pythagoras/index.html). In the previous example, the <-> operator is used to calculate the distance.
You also can add a point by using the + operator, as you do for simple addition:
shop=# SELECT '(1,1)'::point + '2,2'::point; ?column? ---------- (3,3) (1 row)
The operators -, * and / are also supported for performing an operation with a point:
shop=# SELECT '(1,1)'::point - '2,2'::point; ?column? ---------- (-1,-1) (1 row) shop=# SELECT '(1,1)'::point * '2,2'::point; ?column? ---------- (0,4) (1 row) shop=# SELECT '(1,1)'::point / '2,2'::point; ?column? ---------- (0.5,0) (1 row)
The ?- operator is used to see whether two points are horizontal; the ?-| operator checks whether two objects are perpendicular:
shop=# SELECT '(0,0)'::point ?- '-1,0'::point; ?column? ---------- t (1 row)
PostgreSQL offers four operators to check the position of a point relative to a position. The << operator checks whether a point is located left of another point, <^ checks whether the point is below another point, >> checks whether the point is on the right side, and >^ checks whether the point is above another point. In the next example, you can see how such a check can be performed:
shop=# SELECT '(0,0)'::point << '-1,0'::point; ?column? ---------- f (1 row)
PostgreSQL is also able to calculate intersections. The following example calculates the intersection of two line segments:
shop=# SELECT '((0,0),(10,0))'::lseg # '((0,0),(0,10))'::lseg; ?column? ---------- (0,0) (1 row)
PostgreSQL does not return only true or false; it also returns the right value. This is done when evaluating the point of closest proximity:
shop=# SELECT '(99,99)'::point ## '((3,2),(12,4))'::lseg; ?column? ---------- (12,4) (1 row)
In the example, (12, 4) is nearer to (99, 99) than any other point on the line segment.
You can see in these examples that PostgreSQL offers powerful operators you can use to build complex and sophisticated applications. In addition to the operators we have already discussed in this section, the following operators are also available in PostgreSQL:
&& (A overlaps B or not)
&< (overlaps to left)
&> (overlaps to right)
?# (intersects or overlaps)
@-@ (length or circumference)
?| (is vertical)
?|| (is parallel)
@ (contained or on)
@@ (center of)
If you want to get a complete list of all operators in PostgreSQLor if you simply want to know which datatypes the operator you need is defined forcheck out the online help of PostgreSQL by typing \do in your psql-shell.
Combining Queries
One problem encountered when writing SQL statements is that queries might have to be combined because the result of one SQL query might not lead to the final result you want to achieve. Combining queries means that more than one SELECT statement is used in a query. The result of the SELECT statements is put together by using one of the following keywords: UNION, INTERSECT, or EXCEPT.
UNION is used to add the results of SELECT statements, INTERSECT returns only the data the SELECT statements have in common, and EXCEPT returns all records that are not included in the second SELECT statement. For some real-world scenarios, let's go back to the shop database.
Think of a query where we want to retrieve all persons from table employees that are not in table sales. We try the following query:
SELECT * FROM customer EXCEPT SELECT * FROM sales;
Let's see what the data structure of those two tables looks like (use \d to display the data structures):
shop=# \d customer Table "customer" Attribute | Type | Modifier -----------+-------------+--------------------------------------------------- id | integer | not null default nextval('customer_id_seq'::text) name | varchar(50) | not null email | varchar(50) | state | varchar(50) | not null Indices: customer_email_key, customer_id_key, customer_pkey shop=# \d sales Table "sales" Attribute | Type | Modifier -----------+--------------+------------------------------------------------ id | integer | not null default nextval('sales_id_seq'::text) salestime | timestamp | prodid | bigint | prodprice | numeric(9,2) | amount | integer | name | varchar(50) | email | varchar(50) | state | varchar(50) | Indices: sales_email_key, sales_pkey
You can see that those two tables have just a few columns in common. Try to imagine how the database can perform the EXCEPT operation. You will run into trouble when you come to a column that is contained by only one of the two tables
ERROR: Each UNION | EXCEPT | INTERSECT query must have the same number of columns.
An operation must have the same number of columns. If we perform the query like this:
SELECT name FROM customer EXCEPT SELECT name FROM sales;
we receive a correct result:
name ------- Peter (1 row)
Peter is the only person registered as customer who has never bought products. We use the following to retrieve all persons that are in table sales and table customer:
shop=# SELECT name FROM customer INTERSECT SELECT name FROM sales; name -------- John John Robert Robert (4 rows)
Note
This can also be done with a join.
In the result, the database returns four records. If we want to get only one entry for every name, we can use DISTINCT:
SELECT DISTINCT name FROM customer INTERSECT SELECT DISTINCT name FROM sales;
In the example, we have to use DISTINCT in both SELECT statements; otherwise, an error is displayed:
shop=# SELECT DISTINCT name FROM customer INTERSECT SELECT name FROM sales; ERROR: get_sortgroupclause_tle: ORDER/GROUP BY expression not found in targetlist
Now we create a temporary table and insert values into it:
CREATE TEMPORARY TABLE person(name varchar(50), email varchar(50)); INSERT INTO person(name, email) VALUES ('Theodore Ts\'o', 'no@none.any');
If we want to write a query that returns a list containing all names in table sales and table person, we can use UNION:
shop=# SELECT name FROM person UNION SELECT name FROM sales; name --------------- John Robert Theodore Ts'o (3 rows)
Not only can you combine the two SELECT statements. Here is an example where we combine the tables:
shop=# SELECT name FROM person UNION (SELECT name FROM customer EXCEPT SELECT name FROM sales);
The result contains two records:
name --------------- Peter Theodore Ts'o (2 rows)