Document Type Definitions
Eventually, we'll want to do an analysis of the different entities that we're relating together, but in beginning to build our DTD, we'll start off with an excerpt from our products.xml file.
Make a copy and save it in C:\xerces-1_2_3\data (or the appropriate directory on your system). Remove all but the first vendor, Conners Chair Company, as shown in Listing 3.1.
Listing 3.1 products.xml: The Data
0: <?xml version="1.0"?> 1: <products> 2: <vendor webvendor="full"> 3: <vendor_name>Conners Chair Company</vendor_name> 4: <advertisement> 5: <ad_sentence> 6: Conners Chair Company presents their annual big three 7: day only chair sale. We're making way for our new 8: stock! <b>All current inventory must go!</b> Regular prices 9: slashed by up to 60%! 10: </ad_sentence> 11: </advertisement> 12: 13: <product> 14: <product_id>QA3452</product_id> 15: <short_desc>Queen Anne Chair</short_desc> 16: <price pricetype="cost">$85</price> 17: <price pricetype="sale">$125</price> 18: <price pricetype="retail">$195</price> 19: <inventory color="royal blue" location="warehouse"> 20: 12</inventory> 21: <inventory color="royal blue" location="showroom"> 22: 5</inventory> 23: <inventory color="flower print" location="warehouse"> 24: 16</inventory> 25: <inventory color="flower print" location="showroom"> 26: 3</inventory> 27: <inventory color="seafoam green" location="warehouse"> 28: 20</inventory> 29: <inventory color="teal" location="warehouse"> 30: 14</inventory> 31: <inventory color="burgundy" location="warehouse"> 32: 34</inventory> 33: <giveaway> 34: <giveaway_item> 35: Matching Ottoman included 36: </giveaway_item> 37: <giveaway_desc> 38: while supplies last 39: </giveaway_desc> 40: </giveaway> 41: </product> 42: 43: <product> 44: <product_id>RC2342</product_id> 45: <short_desc>Early American Rocking Chair</short_desc> 46: <product_desc> 47: with brown and tan plaid upholstery 48: </product_desc> 49: <price pricetype="cost">$75</price> 50: <price pricetype="sale">$62</price> 51: <price pricetype="retail">$120</price> 52: <inventory location="warehouse">40</inventory> 53: <inventory location="showroom">2</inventory> 54: </product> 55: 56: <product> 57: <product_id>BR3452</product_id> 58: <short_desc>Bentwood Rocker</short_desc> 59: <price pricetype="cost">$125</price> 60: <price pricetype="sale">$160</price> 61: <price pricetype="retail">$210</price> 62: <inventory location="showroom">3</inventory> 63: </product> 64: 65:</vendor> 66: 67:</products>
(Notice that we added a little bit of XHTML markup on line 8.)
In the spirit of testing as we go along, let's go ahead and parse the file to make sure that there's nothing wrong with it to start. To do this, open a command prompt window and type
cd c:\xerces-1_2_3 java sax.SAXCount data/products.xml
We should get a result similar to when we tested the Xerces installation, such as
data/products/xml: 330 ms (37 elems, 27 attrs, 0 spaces, 610 chars)
If you get an error saying that tags are missing or elements are not terminated properly, there is a problem with the products.xml file. You might have inadvertently removed or left an extra tag when you removed the extra two vendors.
If you get an error saying that the class or the file cannot be found, check for typing errors.
Tip - To save time and typing mistakes, commands can be placed in a batch file. For instance, we can take the command to parse this file
java sax.SAXCount data/products.xmland place it in a text file called val.bat, which we place in the c:\xerces-1_2_3 directory. Then, to check the file, we just go to that directory and type the following:
valThe script will handle it from there.
Notice that we left the -v switch off the command. The reason is that we're not ready to validate this file yetwe have no DTD to check against! Every single element would currently be an error because it's not defined.
Internal DTD Subsets
We'll start by embedding the DTD in the XML file, the same way we started with style sheets.
As we mentioned earlier, DTDs use a different syntax than XML itself does. To begin building one, we need to make a space for it in the document, as in Listing 3.2.
Listing 3.2 products.xml: Creating an Internal DTD
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!-- Definition goes here --> 4: 5: ]> 6: 7: <products> 8: <vendor webvendor="full"> ...
The <!DOCTYPE> notation on line 1 is called a Document Type Declaration and lets the processor know that this is the start of a Document Type Definition, or DTD. Line 1 also refers to products, which is the root element of the XML below it, starting on line 7. These two must match because the DTD can describe only specific structures. Notice also the brackets that start on line 1 end on line 5. They'll denote the start and the end of the definition itself. XML-style comments are allowed within the DTD, as you can see on line 3.
From here we can take a pretty literal, straightforward view. We want to define each element in terms of the other elements, attributes, or data it can contain. We'll start with the root element, products. In Listing 3.3, we add the only element that can be contained in products, the vendor.
Listing 3.3 products.xml: Defining the Root Element
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!ELEMENT products (vendor)+> 4: 5: ]> 6: 7: <products> 8: <vendor webvendor="full"> ...
Line 3 tells the parser that we have an element named products and that all it can contain is vendor elements. The + sign tells the parser that we can have one or more vendors. Because this is the root element, we want to make sure that we have some data, so at least one is required. Now we need to define the vendor element, as in Listing 3.4.
Listing 3.4 Defining the Vendor Element
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!ELEMENT products (vendor)+> 4: 5: <!ELEMENT vendor (vendor_name, advertisement?, product*)> 6: 7: ]> 8: 9: <products> 10:<vendor webvendor="full"> ...
Line 5 defines a vendor as an element that may contain a vendor_name, advertisement, and products. Actually, we're saying that it must contain exactly one vendor_name, it may contain one advertisement (using the ?), and it may contain any number of products, including 0 (using the *).
We haven't completely defined the vendor element yet, however. Looking at the XML file, we see that vendor can have an attribute, webvendor. We need to put this into the DTD, as in Listing 3.5.
Listing 3.5 products.xml: Adding Attributes to the Vendor Element
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!ELEMENT products (vendor)+> 4: 5: <!ELEMENT vendor (vendor_name, advertisement?, product*)> 6: <!ATTLIST vendor webvendor CDATA #REQUIRED> 7: 8: ]> 9: 10: <products> 11:<vendor webvendor="full"> ...
Let's pick line 6 apart piece by piece. First, the <!ATTLIST> notation indicates that we're defining an attribute list for an element, as opposed to the element itself. Next, we note what element the attribute list is forspecifically, vendor. Then we list the name of the attribute, webvendor, and the type of data that can be contained in it, followed by the fact that the attribute is required.
So, the definition on line 6 means that the vendor element must have one attribute, which is called webvendor and can contain character data.
That's not really very helpful, though, because it doesn't specify anything about what that text should be. We need to make sure that it's one of our three choices, full, partial, or no. We can do that on line 6 of Listing 3.6.
Listing 3.6 products.xml: Specifying Content for the webvendor Attribute
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!ELEMENT products (vendor)+> 4: 5: <!ELEMENT vendor (vendor_name, advertisement?, product*)> 6: <!ATTLIST vendor webvendor ( full | partial | no ) #REQUIRED> 7: 8: ]> 9: 10: <products> 11:<vendor webvendor="full"> ...
We've seen the | connector before, when we were using XSLT. At that time it worked as a sort of "or" statement, and it still does. Only one of those three values is allowed. The value of webvendor must be full or partial or no.
Let's move on to our other elements. Listing 3.7 defines vendor_name, advertisement, and product.
Listing 3.7 products.xml: Specifying Content for the vendor Element
0: <?xml version="1.0"?> 1: <!DOCTYPE products [ 2: 3: <!ELEMENT products (vendor)+> 4: 5: <!ELEMENT vendor (vendor_name, advertisement?, product*)> 6: <!ATTLIST vendor webvendor ( full | partial | no ) #REQUIRED> 7: 8: <!ELEMENT vendor_name (#PCDATA)> 9: 10:<!ELEMENT advertisement (ad_sentence)+> 11:<!ELEMENT ad_sentence (#PCDATA)> 12: 13:<!ELEMENT product (product_id, short_desc, product_desc?, price+, inventory+, giveaway?)> 14: 15:<!ELEMENT product_id (#PCDATA)> 16:<!ELEMENT short_desc (#PCDATA)> 17:<!ELEMENT product_desc (#PCDATA)> 18: 19:<!ELEMENT price (#PCDATA)> 20:<!ATTLIST price pricetype (cost | sale | retail) 'retail'> 21: 22:<!ELEMENT inventory (#PCDATA)> 23:<!ATTLIST inventory color CDATA #IMPLIED 24: location (showroom | warehouse) 'warehouse'> 25: 26:<!ELEMENT giveaway (giveaway_item, giveaway_desc)> 27:<!ELEMENT giveaway_item (#PCDATA)> 28:<!ELEMENT giveaway_desc (#PCDATA)> 29: 30:]> 31: 32:<products> 33:<vendor webvendor="full"> ...
Let's take this one line at a time. vendor_name, on line 8, is a simple text element, as are product_id, short_desc, and product_desc. #PCDATA represents "Parsed Character Data." This means that it is normal text, but we're assuming that it has already been parsedthat is, there is no markup contained in it.
advertisement can contain one or more ad_sentences, and giveaway must contain one giveaway_item and one giveaway_desc.
The product element is a little more complicated but not much. It contains only elements: specifically, exactly one product_id and short_desc, one optional product_desc, one or more prices, one or more inventory elements, and then an optional giveaway.
It's important to note that the order matters. Subelements must appear in the order in which they're listed in the element's definition.
Now let's take some of the more interesting elements.
Attribute Definitions
On lines 19 and 20, we're defining the price element as having a single attribute, called pricetype, which may take the values of cost, sale, or retail. This is called an enumerated datatype, because we are choosing from a set of values. Although we do need to have this information for every price element, we have not made it required. Instead we've given it a default value. If a value isn't supplied, the default value of 'retail' will be used when the data is processed.
touching it!Actually, all attributes need some way to handle default values. This can take one of four forms:
-
#REQUIREDAs we saw with webvendor, we can force the XML file to provide a value for the attribute.
-
#IMPLIEDIf an attribute is #IMPLIED, it's not required. If a value isn't supplied, there is no default value, but it's not an error. For instance, we are not concerned if a color isn't specified.
-
A literal default, such as 'retail' or 'false'In this case, we provide a value that will be used if no value is provided for the attribute.
-
#FIXED 'literal'If an attribute is set as #FIXED, it must always have the literal value supplied. If it's not supplied, the parser will fill in the value. If it is supplied, it has to match.
Finally, on lines 22 through 24 we have the definition of the inventory element with two attributes, color and location, but we've taken advantage of the capability to include more than one attribute in a single declaration to make things easier to read. The color attribute is a string datatype, as we've indicated by setting it as CDATA, or character data.
At this point, we've defined all of our elements, so we're ready to go ahead and validate the document. To do that, we'll go to a command prompt and, after making sure that we're in the xerces-1_2_3 directory, type
java sax.SAXCount data/products.xml -v
Mixed Content
At this point, if we haven't mistyped any of the elements in the DTD, we should see the results of parsing the document. The parser should return a message that says something like the following:
[Error] products.xml:40:14: Element type "b" must be declared. [Error] products.xml:42:21: The content of element type "ad_sentence" must match "(#PCDATA)". data/products.xml: 3790 ms (38 elems, 27 attrs, 137 spaces, 473 chars)
Congratulations, you've validated your first document! But wait, what about those errors? Those errors mean that the parser is doing exactly what it's supposed to do. We specified ad_sentence as containing nothing but #PCDATA, or parsed character data. That means that no markup is allowed. Remember, however, that we went ahead and added a bit of markup to ad_sentence when we saved the file.
So, what can we do if we want to allow, say, some XHTML tags in the vendor's advertisement? We need to specifically tell the DTD that this element can contain both #PCDATA and elements. This is called Mixed Content. In Listing 3.8, we'll tell the DTD that ad_sentence can contain any number of the specified items.
Listing 3.8 Mixed Content
... 8: <!ELEMENT vendor_name (#PCDATA)> 9: 10:<!ELEMENT advertisement (ad_sentence)+> 11:<!ELEMENT ad_sentence (#PCDATA | b | i | p )*> 12:<!ELEMENT b (#PCDATA)> 13:<!ELEMENT i (#PCDATA)> 14:<!ELEMENT p (#PCDATA)> 15: 16:<!ELEMENT product (product_id, short_desc, product_desc?, price+, inventory+, giveaway?)> ...
Let's take a good hard look at what this means on line 11. First, because we're following the parentheses with the *, whatever is inside them can appear any number of times within the element. This means that it's acceptable for ad_sentence to be made up of #PCDATA, then a b element, then more #PCDATA, or any combination of #PCDATA and the elements listed.
Of course, we then have to go ahead and declare those elements, as we've done on lines 12 through 14. Even though you and I know they're just XHTML, the parser doesn't. To the parser, they are elements, just as vendor and product and price are elements.
Make the change to the DTD and revalidate the file. This time there should be no errors.
DTD Syntax Review
We've covered a lot of ground here, so let's take a moment and review the specific syntax for building DTDs. An element declaration consists of a name and a content model:
<!ELEMENT element-name (content)>
Content can be an element, a series of elements, a choice of elements, or #PCDATA. We can use the following special characters to add more information:
-
+At least one is required, but the element may repeat.
-
?The element is not required, and may appear only once.
-
*Not required, but may repeat.
-
|Indicates a choice between elements.
An attribute definition consists of information about the element, the type, and the default:
<!ATTLIST element-name attribute-name TYPE 'default'>
The type is typically either CDATA, which is just text, a series of choices (such as (true | false)), or ID, IDREF, and so on, which will be discussed later. Finally, we list the default, which can be #IMPLIED, #REQUIRED, #FIXED 'somevalue', or just 'somevalue'.
The First Limitation: Datatypes
One thing you might have noticed is that although we can (and must, in fact) get specific about the types of subelements an element can contain, we don't have a lot of control over the specific types of data after we get down to the text level. Elements can be #PCDATA, attributes are CDATA, and that's it. There's no way to indicate that, say, a price can be only a number, or inventory must be an integer.
This is one of the serious limitations of DTDs and was one of the first indications that a better system was needed. Datatypes will be covered by XML Schema.