- All About DTDs
- Validating a Document by Using a DTD
- Creating Element Content Models
- Commenting a DTD
- Supporting External DTDs
- Handling Namespaces in DTDs
- Summary
- Q&A
- Workshop
Creating Element Content Models
To declare the syntax of an element in a DTD, we use the <!ELEMENT> element like this: <!ELEMENT name content_model>. In this syntax, name is the name of the element we're declaring and content_model is the content model of the element. A content model indicates what content the element is allowed to havefor example, you can allow child elements or text data, or you can make the element empty by using the EMPTY keyword, or you can allow any content by using the ANY keyword, as you'll soon see. Here's how to declare the <document> element in ch04_01.xml:
<!DOCTYPE document [ <!ELEMENT document (employee)*> . . . ]>
This <!ELEMENT> element not only declares the <document> element, but it also says that the <document> element may contain <employee> elements. When you declare an element in this way, you also specify what contents that element can legally contain; the syntax for doing that is a little involved. The following sections dissect that syntax, taking a look at how to specify the content model of elements, starting with the least restrictive content model of allANY, which allows any content at all.
Handling Any Content
If you give an element the content model ANY, that element can contain any content, which means any elements and/or any character data. What this really means is that you're turning off validation for this element because the contents of elements with the content model ANY are not even checked. Here's how to specify the content model ANY for an element named <document>:
<!DOCTYPE document [ <!ELEMENT document ANY> . . . ]>
As far as the XML validator is concerned, this just turns off validation for the <document> element. It's usually not a good idea to turn off validation, but you might want to turn off validation for specific elements, for example, if you want to debug a DTD that's not working. It's usually far preferable to actually list the contents you want to allow in an element, such as any possible child elements the element can contain.
Specifying Child Elements
You can specify what child elements an element can contain in that element's content model. For example, you can specify that an element can contain another element by explicitly listing the name of the contained element in parentheses, like this:
<!DOCTYPE document [ <!ELEMENT document (employee)*> . . . ]>
This specifies that a <document> element can contain <employee> elements. The * here means that a <document> element can contain any number (including zero) <employee> elements. (We'll talk about what other possibilities besides * are available in a few pages.) With this line in a DTD, you can now start placing an <employee> element or elements inside a <document> element, this way:
<?xml version = "1.0" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> ]> <document> <employee> . . . </employee> </document>
Note, however, that this is no longer a valid XML document because you haven't specified the syntax for individual <employee> elements. Because <employee> elements can contain <name>, <hiredate>, and <projects> elements, in that order, you can specify a content model for <employee> elements this way:
<?xml version = "1.0" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (name, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> </document>
Listing multiple elements in a content model this way is called creating a sequence. You use commas to separate the elements you want to have appear, and then the elements have to appear in that sequence in our XML document. For example, if you declare this sequence in the DTD:
<!ELEMENT employee (name, hiredate, projects)>
then inside an <employee> element, the <name> element must come first, followed by the <hiredate> element, followed by the <projects> element, like this:
<employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee>
This example introduces a whole new set of elements<name>, <hiredate>, <lastname>, and so onthat don't contain other elements at allthey contain text. So how can you specify that an element contains text? Read on.
Handling Text Content
In the preceding section's example, the <name>, <hiredate>, and <lastname> elements contain text data. In DTDs, non-markup text is considered parsed character data (in other words, text that has already been parsed, which means the XML processor shouldn't touch that text because it doesn't contain markup). In a DTD, we refer to parsed character data as #PCDATA. Note that this is the only way to refer to text data in a DTDyou can't say anything about the actual format of the text, although that might be important if you're dealing with numbers. In fact, this lack of precision is one of the reasons that XML schemas were introduced.
Here's how to give the text-containing elements in the PCDATA content model example:
<?xml version = "1.0" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (name, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product,id,price)> <!ELEMENT product (#PCDATA)> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA)> ]> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> </document>
NOTE
Can you mix elements and PCDATA in the same content model? Yes, you can. This is called a mixed content model, and you'll see how to work with such models in a few pages.
You're almost done with the sample DTDexcept for the * symbol. The following section takes a look at * and the other possible symbols to use.
Specifying Multiple Child Elements
There are a number of options for declaring an element that can contain child elements. You can declare the element to contain a single child element:
<!ELEMENT document (employee)>
You can declare the element to contain a list of child elements, in order:
<!ELEMENT document (employee, contractor, partner)>
You can also use symbols with special meanings in DTDs, such as *, which means "zero or more of," as in this example, where you're allowing zero or more <employee> elements in a <document> element:
<!ELEMENT document (employee)*>
There are a number of other ways of specifying multiple children by using symbols. (This syntax is actually borrowed from regular expression handling in the Perl language, so if you know that language, you have a leg up here.) Here are the possibilities:
x+Means x can appear one or more times.
x*Means x can appear zero or more times.
x?Means x can appear once or not at all.
x, yMeans x followed by y.
x | yMeans x or ybut not both.
The following sections take a look at these options.
Allowing One or More Children
You might want to specify that a <document> element can contain between 200 and 250 <employee> elements, and if you do, you're out of luck with DTDs because DTD syntax doesn't give us that kind of precision. On the other hand, you still do have some control here; for example, you can specify that a <document> element must contain one or more <employee> elements if you use a + symbol, like this:
<!ELEMENT document (employee)+>
Here, the XML processor is being told that a <document> element has to contain at least one <employee> element.
Allowing Zero or More Children
By using a DTD, you can use the * symbol to specify that you want an element to contain any number of child elementsthat is, zero or more child elements. You saw this in action earlier today, when you specified that the <document> element may contain <employee> elements in the ch04_01.xml example:
<!ELEMENT document (employee)*>
Allowing Zero or One Child
When using a DTD, you can use ? to specify zero or one child elements. Using ? indicates that a particular child element may be present once in the element you're declaring, but it need not be. For example, here's how to indicate that a <document> element may contain zero or one <employee> elements:
<!ELEMENT document (employee)?>
Using +, *, and ? in Sequences
You can use the +, *, and ? symbols in content model sequences. For example, here's how you might specify that there can be one or more <name> elements for an employee, an optional <hiredate> element, and any number of <project> elements:
<?xml version = "1.0" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (name+, hiredate?, projects*)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product,id,price)> <!ELEMENT product (#PCDATA)> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA)> ]> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> </document>
Using +, *, and ? inside sequences provides a lot of flexibility because it means you can specify how many times an element can appear in a sequenceand even whether the element can be absent altogether.
In fact, you can get even more powerful results by using the +, *, and ? operators inside sequences. By using parentheses, we can create subsequencesthat is, sequences inside sequences. For example, say that we wanted to allow each employee to list multiple names (including nicknames and so on), possibly list his or her age, and give multiple phone numbers. You can do that by using the subsequence shown in Listing 4.2.
Listing 4.2 A Sample XML Document That Uses Subsequences in a DTD (ch04_02.xml)
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee ((name, age?, phone*)+, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product,id,price)> <!ELEMENT product (#PCDATA)> <!ELEMENT age (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA)> ]> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <phone> 555.2345 </phone> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> <employee> <name> <lastname>Grant</lastname> <firstname>Cary</firstname> </name> <age> 32 </age> <phone> 555.2346 </phone> <hiredate>October 20, 2005</hiredate> <projects> <project> <product>Desktop</product> <id>333</id> <price>$2995.00</price> </project> <project> <product>Scanner</product> <id>444</id> <price>$200.00</price> </project> </projects> </employee> <employee> <name> <lastname>Gable</lastname> <firstname>Clark</firstname> </name> <age> 46 </age> <phone> 555.2347 </phone> <hiredate>October 25, 2005</hiredate> <projects> <project> <product>Keyboard</product> <id>555</id> <price>$129.00</price> </project> <project> <product>Mouse</product> <id>666</id> <price>$25.00</price> </project> </projects> </employee> </document>
Getting creative when defining subsequences and using the +, *, and ? operators allows us to be extremely flexible in DTDs.
Allowing Choices
DTDs can support choices. By using a choice, we can specify one of a group of items. For example, if you want to specify that one (and only one) of either <x>, <y>, or <z> will appear, use a choice like this:
(x | y | z)
Listing 4.3 shows an example of using choices in the document ch04_03.xml. In that example, each product is allowed to contain either a <price> element or a <discountprice> element. To indicate that that's what you want, you only need to make this change to the DTD (as well as declare the new <discountprice> element):
<!ELEMENT project (product, id, (price | discountprice))>
Listing 4.3 A Sample XML Document That Uses Choices in a DTD (ch04_03.xml)
<?xml version = "1.0" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (name, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product, id, (price | discountprice))> <!ELEMENT product (#PCDATA)> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA) > <!ELEMENT discountprice (#PCDATA)> ]> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <discountprice>$111.00</discountprice> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> . . . <employee> <name> <lastname>Gable</lastname> <firstname>Clark</firstname> </name> <hiredate>October 25, 2005</hiredate> <projects> <project> <product>Keyboard</product> <id>555</id> <price>$129.00</price> </project> <project> <product>Mouse</product> <id>666</id> <discountprice>$25.00</discountprice> </project> </projects> </employee> </document>
You can also use the +, *, and ? operators with choices. For example, to allow multiple discount prices and to insist that at least one element from the choice appear in the XML document, you can do something like this:
<!ELEMENT project (product, id, (price | discountprice*)+)>
As you can see, there are plenty of options available when it comes to specifying elements or text content in DTDs (although XML schemas allow us to be even more precise, specifying numeric formats for numbers and so on). But what if we want a content model to let an element contain both elements and text? That's coming up next.
Allowing Mixed Content
When using a DTD, you can allow an element to contain text or child elements, giving it a mixed content model. Note that even with a mixed content model, an element can't contain child elements and text data at the same level at the same time (unless you use the content model ANY). For example, this doesn't work:
<product> Keyboard <stocknumber>1113</stocknumber> <product>
However, you can set up a DTD so that an element can contain either child elements or text data. To do that, we treat #PCDATA as we would any element name in a DTD choice. Listing 4.4 shows an example of this; in this example, the <product> element is declared so that it can have text content or it can contain a <stocknumber> element.
Listing 4.4 A Sample XML Document That Uses a Mixed Content Model (ch04_04.xml)
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (name, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product, id, price)> <!ELEMENT product (#PCDATA | stocknumber)*> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT stocknumber (#PCDATA)> ]> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product> <stocknumber>1111</stocknumber> </product> <id>111</id> <price>$111.00</price> </project> <project> <product> Laptop </product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> . . . <employee> <name> <lastname>Gable</lastname> <firstname>Clark</firstname> </name> <hiredate>October 25, 2005</hiredate> <projects> <project> <product> <stocknumber>1113</stocknumber> </product> <id>555</id> <price>$129.00</price> </project> <project> <product>Mouse</product> <id>666</id> <price>$25.00</price> </project> </projects> </employee> </document>
There are plenty of restrictions when we use a mixed content model like this in a DTD. We cannot specify the order of the child elements, and we cannot use the +, *, or ? operators. In fact, there's usually very little reason to use mixed content models at all in XML. We're almost always better off being consistent and declaring a new element that can contain our text data than using a mixed content model.
Allowing Empty Elements
Elements don't need to have any content at all, of course; they can be empty. As you would expect, you can support empty elements by using DTDs. In particular, you can create an empty content model with the keyword EMPTY, like this:
<!ELEMENT intern EMPTY>
This declares an empty element named <intern/> that you can use to indicate that an employee is an intern. Listing 4.5 shows this new empty element at work in ch04_05.xml. As you can see, this example allows each <employee> element to contain an <intern/> elementand makes that element optional.
Listing 4.5 A Sample XML Document That Uses an Empty Element (ch04_05.xml)
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?> <!DOCTYPE document [ <!ELEMENT document (employee)*> <!ELEMENT employee (intern?, name, hiredate, projects)> <!ELEMENT name (lastname, firstname)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT firstname (#PCDATA)> <!ELEMENT hiredate (#PCDATA)> <!ELEMENT projects (project)*> <!ELEMENT project (product, id, price)> <!ELEMENT product (#PCDATA)> <!ELEMENT id (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT intern EMPTY> ]> <document> <employee> <intern/> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee> . . . <employee> <intern/> <name> <lastname>Gable</lastname> <firstname>Clark</firstname> </name> <hiredate>October 25, 2005</hiredate> <projects> <project> <product>Keyboard</product> <id>555</id> <price>$129.00</price> </project> <project> <product>Mouse</product> <id>666</id> <price>$25.00</price> </project> </projects> </employee> </document>
Empty elements can't contain any content, but they can contain attributes, and tomorrow we'll talk about how to support attributes in DTDs.