- Dissecting an XML Document Type Definition
- Using Document Type Definition Notation and Syntax
- Understanding Literals
- Declaring a NOTATION
- Creating ATTLIST Declarations
- Using Special XML Datatype Constructions
- Understanding the Difference Between Well-Formed and Valid XML
- Learning How to Use External DTDs and DTD Fragments
- Altering an XML DTD
- Getting Down to Cases
Learning How to Use External DTDs and DTD Fragments
One of the strengths of XML is that you can use or reuse the document types defined for one document for as many others as you like. DTDs can reside in a central repository and can even be combined to make larger DTDs by choosing modular sections.
The following subsections describe how DTDs can be accessed in XML documents or in the DTDs that define them.
Pointing to an External DTD
Non-local external DTDs can be pointed to using the DOCTYPE declaration like this if the DTD is on the Web:
<!DOCTYPE article PUBLIC "-//LeeAnne.com//Article DTD//EN" "http://www.leeanne.com/XML/article.dtd">
Or, a series of DTD fragments can be pointed to using parameter entity references like this:
<!DOCTYPE article PUBLIC "-//LeeAnne.com//Article DTD//EN" "http://www.leeanne.com/XML/article.dtd"> [ <!ENTITY % header PUBLIC "-//LeeAnne.com//Header DTD//EN" "http://www.leeanne.com/XML/header.dtd">> %header; <!ENTITY % footer PUBLIC "-//LeeAnne.com//Footer DTD//EN" "http://www.leeanne.com/XML/footer.dtd">> %footer; ... ]>
This mechanism is widely used to call in files of character entity references but can also be used for other purposes. Be aware, however, that non-validating processors are forbidden to interpret any parsed entity following an external parameter entity reference it doesn't read.
The reason is that the state of every parsed entity is undefined after any external reference not incorporated into the document. In fact, it's legal for a non-validating processor to ignore external entities entirely.
Pointing to a Local DTD
Although a local DTD is also an external DTD, there is a slightly different syntax used to reference local DTDs because one doesn't ordinarily include a catalog reference. Local DTDs can be pointed to using the DOCTYPE declaration like this if the DTD is on your local hard drive:
<!DOCTYPE article SYSTEM "article.dtd">A series of DTD fragments can be pointed to using parameter entity references like this:
<!DOCTYPE article SYSTEM "article.dtd"> [ <!ENTITY % header SYSTEM "header.dtd">> %header; <!ENTITY % footer SYSTEM "footer.dtd">> %footer; ... ]>
This mechanism is widely used to call in files of character entity references but can also be used for other purposes. Be aware, however, that non-validating processors are not allowed to interpret any parsed entity following an external parameter entity reference, including a DTD, it doesn't read.
The reason for this is that the state of every parsed entity is undefined after any external reference not incorporated into the document because a value might or might not have been set for it in the unread external entity. A non-validating parser has no way of knowing one way or another. In fact, it's legal for a non-validating processor to ignore external entities entirely.
TIP
Why do we bother with non-validating processors anyway? Wouldn't it be better to validate everything? In a word, no. Imagine how difficult it would be to load a page if browsers required every link to be traversed and verified before it would highlight a hyperlink. XML browsers can transclude, include as actual content, documents from anywhere on the Web. So you have no control over how pages are structured. If a page you transclude includes extensive and recursive access to other pages or DTDs, you may have a long wait before the page loads. XML browsers are likely to mark the place where an external reference is made and let the user choose whether to load it, much as the user has control over taking a hyperlink.
It's likely that most actual user agents (browsers) used on the Web are non-validating. The potential overhead of validation, visiting every referenced location, is so great that the most sensible plan for any browser is to wait to visit external documents until requested to do so by the user.
Using DTD Fragments
The preceding two examples both used DTD fragments to extend the article DTD. If you think of a document as a tree, then a DTD fragment is a way to graft on another limb to the tree. The DTD must be structured in a way that this can be done with careful attention to namespaces until the XML namespace initiative comes to fruition.
The most common use of DTD fragments is to reference the long lists of general entities used to refer to character entities by mnemonic name. So you can use the more understandable ‘ instead of ‘ when referring to a left single quotation mark. Few people have memorized the table of ISO and Unicode character by number. It's just too hard for most people. The easiest way to use the mnemonics is to call in a predefined set of them by name. There are a lot of these, of course, but the ordinary ones you'll see are Latin-1, Special Symbols, and Mathematics and Greek Symbols.
Here's how to point to Latin-1 for XHTML:
<!ENTITY % ISOlat1 PUBLIC "-//W3C//ENTITIES Latin1//EN//XML" "http://www.w3.org/TR/xhtml1/DTD/HTMLlat1x.ent"> %ISOlat1;
The one we just looked at is defined as
<!ENTITY lsquo "‘"> <!-- left single quotation mark, U+2018 ISOnum -->
inside the special characters set and invoked like this:
<!ENTITY % HTMLspecial PUBLIC "-//W3C//ENTITIES Special//EN//HTML" "http://www.w3.org/TR/xhtml1/DTD/HTMLspecialx.ent"> %HTMLspecial;
If you're going to do any math at all you might need this one as well:
<!ENTITY % HTMLsymbol PUBLIC "-//W3C//ENTITIES Symbols//EN//HTML" "http://www.w3.org/TR/xhtml1/DTD/HTMLsymbolx.ent"> %HTMLsymbol; which contains things like <!ENTITY cong "≅"> <!-- approximately equal to, U+2245 ISOtech --> <!ENTITY asymp "≈"> <!-- almost equal to = asymptotic to, U+2248 ISOamsr --> <!ENTITY ne "≠"> <!-- not equal to, U+2260 ISOtech --> <!ENTITY equiv "≡"> <!-- identical to, U+2261 ISOtech --> <!ENTITY le "≤"> <!-- less-than or equal to, U+2264 ISOtech --> <!ENTITY ge "≥"> <!-- greater-than or equal to, U+2265 ISOtech --> <!ENTITY sub "⊂"> <!-- subset of, U+2282 ISOtech --> <!ENTITY sup "⊃"> <!-- superset of, U+2283 ISOtech -->
These symbols are commonly used in mathematics and scientific fields. This symbol set also contains Greek letters and other goodies to make it easier to type simple mathematics into your pages.
A good way to allow easy use of modular DTDs is to provide stubs within the document, which can be used to expand on DTD capabilities when needed and ignored
As an example, a DTD might start out with some entities that point to null files like this:
<!ENTITY % header SYSTEM "nullfile.dtd">> %header; <!ENTITY % footer SYSTEM "nullfile.dtd">> %footer; ... ]>
When you wanted to actually use the header and footer information you could override these values with ones that actually do something like this:
<!DOCTYPE article SYSTEM "article.dtd"> [ <!ENTITY % header SYSTEM "header.dtd">> <!ENTITY % footer SYSTEM "footer.dtd">> ... ]>
In the next chapter, you'll explore another way to insert new tags, or entire branches
Parameter Entities
A parameter entity is a way to store data (or point to it for later retrieval) for later use. Parameter entities are valid only within a DTD and have different behaviors in the internal subset of the DTD and the external subset. This asymmetric behavior was decided upon to make it easier to parse an internal DTD for a non-validating XML processor.
NOTE
After spending so much time clarifying the different ways in which you can use parameter entities to make life easy, it's somewhat ironic that most of the XML schema proposals pretty much scrap parameter entities in favor of an XML-like, as opposed to DTD-like, syntax. And so it goes....
You can put any number of things into a parameter entity and use them at your convenience. A typical use is to read in external files within a DTD, either in the external or internal subset. Another common use is to store integral bits of markup parts so that they can be used in a mnemonic way as shorthand for complex expressions. For bits of markup that are used over and over again, the gain in clarity can be enormous.
You may want to refer to Chapter 4, "Extending a Document Type Definition with Local Modifications," for a complete example of working with and around parameter entities. The following expands an element defined using parameter entities because a a copy of it was made in the internal DTD subset:
<!ELEMENT blink %Flow;> <!ATTLIST blink %attrs; >
The previous code, with two parameter entities in two locations, expanded through a series of indirection to this:
<!ELEMENT blink (#PCDATA | p | h1 | h2 | h3 | h4 | h5 | h6 | div | ul | ol | dl | menu | dir | pre | hr | blockquote | address | center | noframes | isindex | fieldset | table | form | a | br | span | bdo | object | applet | img | map | iframe | tt | i | b | big | small | u | s | strike |font | basefont | em | strong | dfn | code | q | sub | sup | samp | kbd | var | cite | abbr | acronym | input | select | textarea | label | button | ins | del | script | noscript)* > <!ATTLIST blink id ID #IMPLIED class CDATA #IMPLIED style CDATA #IMPLIED title CDATA #IMPLIED lang NMTOKEN #IMPLIED xml:lang NMTOKEN #IMPLIED dir (ltr|rtl) #IMPLIED onclick CDATA #IMPLIED ondblclick CDATA #IMPLIED onmousedown CDATA #IMPLIED onmouseup CDATA #IMPLIED onmouseover CDATA #IMPLIED onmousemove CDATA #IMPLIED onmouseout CDATA #IMPLIED onkeypress CDATA #IMPLIED onkeydown CDATA #IMPLIED onkeyup CDATA #IMPLIED >
This expansion is obviously somewhat harder to read than the parameter entity equivalent, especially if the construct is used in many locations.
Understanding the W3C Entity Table
Table 3.2 explains where everything can go in a nutshell. Although it really isn't all that clear, a similar table is used in the W3C XML 1.0 Recommendation to attempt to clarify the rules that entities obey in various contexts and is reproduced or paraphrased almost everywhere:
Table 3.2 XML 1.0 Entity Type Behaviors
Entity Type |
Parameter |
Internal General |
External Parsed General |
Unparsed |
Character |
Reference in Content |
Not recognized |
Included |
Included if validating |
Forbidden |
Included |
Reference in Attribute Value |
Not recognized |
Included in literal |
Forbidden |
Forbidden |
Included |
Occurs as Attribute Value |
Not recognized |
Forbidden |
Forbidden |
Notify |
Not recognized |
Reference in Entity Value |
Included in literal |
Bypassed |
Bypassed |
Forbidden |
Included |
Reference in DTD |
Included as PE |
Forbidden |
Forbidden |
Forbidden |
Forbidden |
Let's try to clarify this blob of information. The first row of the table refers to locations in the document itself. The second and third rows refer to locations in either the document or its DTD. The final two rows refer to locations in the DTD only. The table starts to make sense only when you realize that the information content depicted is sparse. There are quite a number of entries that tell you only that a given entity type can only appear in one location. I question whether the table is all that valuable, although people will duplicate or paraphrase it. In the actual XML 1.0 Recommendation, almost all the real information about the table lies in the text that references it.
Let's list the few bits of information the table contains in seven simplified rules:
Parameter entities can be used only in the DTD. Outside the DTD, their calling sequence, %name;, is treated as plain text.
Inside a declaration, parameter entities are expanded but only in the external subset.
Outside of a declaration, parameter entities can only be used to insert complete markup declarations and surround all insertions with a leading and trailing space.
Internal general entities can only be used to insert text and are bypassed in the DTD. Bypassed means that they are recognized and their name is entered into a lookup table with no current value. When they are declared, the lookup entry has a value associated with it. It's an obscure way of saying you can use them before they are declared.
External parsed general entities are treated just like internal general entities except that you can't refer to them in an attribute value. This is on account of the difficulty of handling character encodings in the context of an attribute.
Unparsed entities can only be used as a name in an entity declaration. Nothing else is recognized in that context, including character entities. The only responsibility of the XML processor is to notify the helper application declared in a notation declaration that the unparsed entity needs processing.
Character entities are treated like general entities except that they are included immediately in all cases. Although they are universal, in most cases they should be used as general entities from one of the predefined sets. The raw numeric references, although legal, are almost impossible to read and understand. So, the mnemonic ≠ is easier to identify than the decidedly hard to remember ≠ equivalent, although both refer to the "not equal to" (≠) symbol.
Parameter entities are so powerful, in fact, that the XML 1.0 standard takes care to curb their power by restricting them first to the DTD and then further restricting the most complex uses to the external subset of the DTD, where non-validating XML processors never (well, rarely) look.
TIP
If you're using parameter entities to store text, be aware that XML processors insert one leading and one trailing space when expanding them. This is to discourage people from using them to store small pieces of text and building words (or especially markup!) out of them.
The need for parameter entities probably will eventually disappear when one or more of the XML schema proposals gains market share in the minds of users. But this will be a while as the dueling standards have a lot to reconcile yet.