XML Syntax Rules
In this section, we explain the various syntactical rules of XML. Documents that follow these rules are called well-formed, but not necessarily valid, as we'll see. If your document breaks any of these rules, it will be rejected by most, if not all, XML parsers.
Well-Formedness
The minimal requirement for an XML document is that it be well-formed, meaning that it adheres to a small number of syntax rules,6 which are summarized in Table 3-1 and explained in the following sections. However, a document can abide by all these rules and still be invalid. To be valid, a document must both be well-formed and adhere to the constraints imposed by a DTD or XML Schema.
TABLE 3-1 XML Syntax Rules (Well-Formedness Constraints)
|
|
|
|
|
|
|
|
|
|
Legal XML Name Characters
An XML Name (sometimes called simply a Name) is a token that
begins with a letter, underscore, or colon (but not other punctuation)
continues with letters, digits, hyphens, underscores, colons, or full stops [periods], known as name characters.
Names beginning with the string "xml", or any string which would match (('X'|'x')('M'|'m')('L'|'l')), are reserved.
Element and attribute names must be valid XML Names. (Attribute values need not be.) An NMTOKEN (name token) is any mixture of name characters (letters, digits, hyphens, underscores, colons, and periods).
NOTE
The Namespaces in XML Recommendation assigns a meaning to names that contain colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes (e.g., xsl:template).
Listing 3-2 illustrates a number of legal XML Names, followed by three that should be avoided but may or may not be identified as illegal, depending on the XML parser you use, and four that are definitely illegal. (This is file name-tests.xml on the CD; you can try this with your favorite parser, or with one of the ones provided on the CD.)
Listing 3-2 Legal, Illegal, and Questionable XML Names
<?xml version = "1.0" standalone = "yes" encoding = "UTF-8"?> <Test> <!-- legal --> <price /> <Price /> <pRice /> <_price /> <subtotal07 /> <discounted-price /> <discounted_price /> <discounted.price /> <discountedPrice /> <DiscountedPrice /> <DISCOUNTEDprice /> <kbs:DiscountedPrice /> <xlink:role /> <xsl:apply-templates /> <!-- discouraged --> <xml-price /> <xml:price /> <discounted:price /> <!-- illegal --> <7price /> <-price /> <.price /> <discounted price /> </Test>
From the legal examples, we see that any mixture of uppercase and lowercase is fine, as are numbers, and the punctuation characters that were in the definition.
Since the last three examples in the first group use a colon, they are assumed to be elements in the namespaces identified by the prefixes "kbs", "xlink", and "xsl". Of these, the last two refer to W3C-specified namespaces; xlink:role is an attribute defined by the XLink specification and xsl:apply-templates is an element defined by the XSLT specification. The "kbs" prefix refers to a hypothetical namespace, which I could have declared (but didn't), since namespaces do not come only from the W3C. (See chapter 5 for a thorough discussion of namespaces.)
The three debatable examples are xml-price, xml:price, and discounted: price. The first two use the reserved letters "xml"; you shouldn't use them, but most parsers won't reject them. The discounted:price example uses a colon, which is frowned upon if "discounted" is not meant to be a prefix associated with a declared namespace.
The four illegal cases are much more clear. The first three, 7price, -price, and .price, are illegal because the initial character is not a letter, underscore, or colon. The fourth example is illegal because a space character cannot occur in an XML Name. Most parsers will think this is supposed to be the element named discounted and the attribute named price, minus a required equal sign and value.
NOTE
XML Names and NMTOKENS apply to elements, attributes, processing instructions, and many other constructs where an identifier is required, so it's important to understand what is and what is not legal.
Elements and Attributes Are Case-Sensitive
Unlike HTML, which is case insensitive (as is the SGML metalanguage of which HTML is an appplication), XML is strictly case-sensitive, and so therefore is every application of XML (e.g., XSLT, MathML, SVG and so forth, plus any languages you create). Therefore, the following elements are all unique and are in no way related to one another in XML:
price Price PRICE
The case sensitivity nature of XML often confuses novices. Be sure to remember this when doing string comparisons in code.
The W3C's Extensible HyperText Markup Language (XHTML) recasts HTML in XML syntax. In XHTML, all elements and attributes have lowercase names, such as:
body h1 img href
Notice that this is not merely a convention; it is an absolute requirement. An XHTML document that contains capital letters in element or attribute names is simply invalid, even though uppercase or mixed-case names such as BODY, Body, or even bOdY would be perfectly acceptable in HTML.
Uppercase Keywords
Since XML is case-sensitive, it should not be surprising that certain special words must appear in a particular case. In general, the keywords that relate to DTDs (e.g., DOCTYPE, ENTITY, CDATA, ELEMENT, ATTLIST, PCDATA, IMPLIED, REQUIRED, and FIXED) must be all uppercase. On the other hand, the various strings used in the XML declaration (e.g., xml, version, standalone, and encoding) must appear in all lowercase.
Case Conventions or Guidelines
When creating your own XML vocabulary, it would be desirable if there were conventions to explain the use of uppercase, lowercase, mixed case, underscores, and hyphens. Unfortunately, no such conventions exist in XML 1.0. It is a good idea to adopt your own conventions and to apply them consistently, at least across your project, but ideally throughout your entire organization.
For example, for element names I prefer using what is often called CamelCase because the initial letter of each word in a multiword name is uppercase and all others are lowercase, creating humps like a camel's back. (It's also sometimes called TitleCase because it resembles the title of a book.) For example:
<DiscountPrice rate="20%" countryCode="US" />
Note that for attributes, I also use CamelCase, except the first word is always begun with a lowercase letter, as in "countryCode". In fact, the terms UpperCamelCase (as I use for elements) and lowerCamelCase (as I use for attributes) are often used to make this distinction more clear. One reason that I favor this convention is that in any context (including documentation), it's easy to distinguish elements from attributes.
It would be just as reasonable, however, to use all uppercase letters for elements, all lowercase for attributes, and a hyphen to separate multipart terms as in the following examples, or even to use all uppercase for elements and attributes.
<DISCOUNT-PRICE rate="20%" country-code="US" />
As stated earlier, for XHTML, the W3C elected to use all lowercase letters. The most important thing is to pick a convention for your project (or your company) and to be consistent across developers and applications.
We've seen UpperCamelCase for elements and lowerCamelCase for attributes in the employee example: Employee with its sex attribute, Address, PhoneNumbers, and so on. The following fragment from the W3C's SOAP 1.2 Part 2 Adjuncts Working Draft (http://www.w3.org/TR/2001/WD-soap12-part2-20011002/#N4008D) illustrates its use of UpperCamelCase for element names and lowerCamelCase for attributes, as well as for namespace prefixes.
<env:Body > <m:GetLastTradePrice env:encodingStyle="http://www.w3.org/2001/09/soap-encoding" xmlns:m="http://example.org/2001/06/quotes" > <m:Symbol>DEF</m:Symbol> </m:GetLastTradePrice> </env:Body>
Root Element Contains All Others
There must be one root element, also known as the document element, which is the parent of all other elements. That is, all elements are nested within the root element. All descendants of the root, whether immediate children or not, represent the content of the root. Recall that the name of the root element is given in the DOCTYPE line if a DTD is referenced (either an external or internal one). We also noted that this document element must be the first element the parser encounters (after the XML prolog, which does not contain elements).
A somewhat surprising aspect, at least to this author, is that the XML Recommendation does not preclude a recursive root! In other words, it is possible for a root element to be defined in a DTD as containing itself. Although this is not common, it is worth noting. For example, in NASA's IML DTD, we allowed that the root element Instrument could contain other Instrument children. (The DTD syntax shown here is formally described in chapter 4.)
<!ELEMENT Instrument (Instrument | Port | CommandProcedureSet)* >
Start and End Tags Must Match
Every start tag must have a corresponding end tag to properly delimit the content of the element the tags represent. The start and end tags are indicated exactly as they are in HTML, with < denoting the beginning of a start tag and </ indicating the beginning of the end tag. The end delimiter of each tag is >.
<ElementName>content</ElementName>
Empty Elements
An exception to the rule about start and end tags is the case in which an element has no content. Such empty elements convey information simply by their presence or possibly by their attributes, if any. Examples from XHTML 1.0 include:
<br /> <hr /> <img src="someImage.gif" width="100" height="200" alt="Some Image" />
An empty element begins like a start tag but terminates with the sequence />. Optional white space may be used before the two terminating characters. This author prefers to include a space to emphasize empty elements. The space before /> is necessary for XHTML 1.0 to be handled correctly by older browser versions. Of course, it's also possible to specify an empty element by using regular start and end tags, and this is syntactically identical (from the parser's viewpoint) to the use of empty-element notation.
<img src="someImage.gif" width="100" height="200" alt="Some Image"></img>
Note that just like in HTML (or more appropriately, XHTML), an empty element is often used as a separator, such as <br /> and <hr />, or to indicate by its presence a particular piece of data, or to convey metadata by its attributes. If the term empty element seems strange to you when attributes are involved, just think in terms of the content of the element. There is no content, even when there are attributes, which is why it's called empty.
Proper Nesting of Start and End Tags
No overlapping of start and end tags from different elements is permitted. Although this might seem like an obvious requirement, HTML as implemented by major browsers is considerably more forgiving and recovers from improper tag overlap. Correct nesting looks like this:
<OuterElement> <InnerElement>inner content</InnerElement> </OuterElement>
An example of improper nesting is:
<OuterElement> <InnerElement>inner content</OuterElement> </InnerElement>
Believe it or not, most browsers recover from this type of error in HTML, but they cannot and will not in XML or any language based on XML syntax. The improper nesting example results in either one or two fatal errors, with a message similar to this (depending on the parser):
Fatal error: end tag '</OuterElement>' does not match start tag. Expected '</InnerElement>' Fatal error: end tag '</InnerElement>' does not match start tag. Expected '</OuterElement>'
Parent, Child, Ancestor, Descendant
The notion of the root element and the proper nesting rules leads us to some conclusions and terminology about the hierarchy of elements that are invariant across all XML documents. The terms ancestor and descendant are not used in the XML 1.0 Recommendation, but they certainly are in the DOM, XSLT, XPath, and so on, which is why they are introduced here:
An element is a child of exactly one parent, which is the element that contains it.
A parent may have more than one child.
Immediate children and also children of a child are descendants of the parent.
An element is an ancestor of all its descendants.
The root is the ancestor of all elements.
Every element is a descendant of the root.
Every element has exactly one parent, except the root, which has no parent.
Attribute Values Must Be Quoted
In HTML (but not in XHTML), we are permitted to be inconsistent in the use of quotation marks to delimit the values of attributes. Generally, single-word values do not require quotes in HTML. For example, both of these are acceptable and equivalent in HTML:
<IMG SRC=someImage.gif> <IMG SRC="someImage.gif">
In XML (and in XHTML), however, we are not allowed to be so cavalier about quotes. All attribute values must be quoted, even if there are no embedded spaces.
<img src="someImage.gif" /> <img src='someImage.gif' /> <img src="someImage.gif" width="34" height="17"/>
Notice that either single or double quotes may be used to delimit the attribute values. Of course, if the attribute value contains double quotes, then you must use single quotes as the delimiter, and vice versa.
<Book title="Tudor's Guide to Paris" /> <Object width='5.3"' height='7.1"' />
White Space Is Significant
White space consists of one or more space characters, tabs, carriage returns, line feeds (denoted as #x20, #x9, #xD, and #xA, respectively). In the XML 1.0 Recommendation, white space is symbolized in production rules by a capital "S", with the following definition (See http://www.w3.org/TR/REC-xml#sec-common-syn and http://www.w3.org/TR/REC-xml#sec-white-space):
S ::= (#x20 | #x9 | #xD | #xA)+
In contrast to HTML, in which a sequence of white space characters is collapsed into a single white space and in which newlines are ignored, in XML all white space is taken literally. This means that the following two examples are not equivalent:
<Publication> <Published>1992</Published> <Publisher>Harmony Books</Publisher> </Publication> <Publication> <Published>1992</Published> <Publisher>Harmony Books</Publisher> </Publication>
By default, XML parsers handle the Publisher element differently since in the second example, the string "Harmony Books" contains a newline between the two words. The application that invokes the parser can either consider the white space important, ignore it (i.e., strip it), or inform the parser that it wants white space normalized (collapsed like in HTML).
Comments
Comments in XML are just like they are in HTML. They begin with the character sequence <!-- and end with the sequence -->. The parser ignores what appears between them, except to verify that the comment is well-formed.
<Publication> <Published>1992</Published> <!-- This appears to be the second edition. --> <Publisher>Harmony Books</Publisher> </Publication>
In XML, however; there are several restrictions regarding comments:
Comments cannot contain the double hyphen combination "--" anywhere except as part of the comment's start and end tags. Thus, this comment is illegal: <!-- illegal comment --->
Comments cannot be nested. This means you need to take care when commenting out a section that already contains comments.
Comments cannot precede the XML declaration because that part of the prolog must be the very first line in the document.
Comments are not permitted in a start or end tag. They can appear only between tags (as if they were content) or surrounding tags.
Comments may be used to cause the parser to ignore blocks of elements, provided that the result, once the commented-out block is effectively removed by the parser, is still well-formed XML.
Parsers are not required to make comments available to the application, so don't use them to pass data to an application; use Processing Instructions, discussed next.
Comments are also permitted in the DTD, as discussed in chapter 4.
Processing Instructions
Processing instructions (often abbreviated as PI) are directives intended for an application other than the XML parser. Unlike comments, parsers are required to pass processing instructions on to the application. The general syntax for a PI is:
<?targetApplication applicationData ?>
Where targetApplication is the name (any XML Name) of the application that should receive the instruction, and applicationData is any arbitrary string that doesn't contain the end delimiter. Often applicationData consists of name/value pairs that resemble attributes with values, but there is no requirement concerning the format. Aside from the delimiters "<?" and "?>", which must appear exactly as shown, the only restriction is that there can be no space between the initial question mark and the target. Some examples follow.
<?xml-stylesheet type="text/xsl" href="foo.xsl" ?> <?MortgageRateHandler rate="7%" period="30 years" ?> <?javaApp class="MortgageRateHandler" ?> <?javaApp This is the data for the MortgageRateHandler, folks! ?> <?acroread file="mortgageRates.pdf" ?>
Processing instructions are not part of the actual structure of the document, so they may appear almost anywhere, except before the XML declaration or in a CDATA section. The parser's responsibility is merely to pass the PI and its data on to the application. Since the same XML document could be processed by multiple applications, it is entirely possible that some applications will ignore a given PI and just pass it down the chain. In that case, the processing instruction will be acted upon only by the application for which it is intended (has meaning).
Although an XML declaration looks like a processing instruction because it is wrapped in the delimiters "<?" and "?>", it is not considered a PI. It is simply an XML declaration, the one-of-a-kind markup that may or may not be the first line of the document.
The target portion of the processing instruction can be a notation (defined in chapter 4). For example:
<!NOTATION AcrobatReader SYSTEM "/usr/local/bin/acroread">
The corresponding PI would be:
<?AcrobatReader file="Readme.pdf" size="75%" ?>
Entity References
Entity references are markup that the parser replaces with character data. In HTML, there are hundreds of predefined character entities, including the Greek alphabet, math symbols, and the copyright symbol. There are only five predefined entity references in XML, however, as shown in Table 3-2.
TABLE 3-2 Predefined Entity References
Character |
Entity Reference |
Decimal Representation |
Hexidecimal Representation |
< |
< |
< |
< |
> |
> |
> |
> |
& |
& |
& |
& |
" |
" |
" |
" |
' |
' |
' |
' |
We've already seen how entity references can be used as content. They can also appear within attribute values. According to Table 3-2,
<CD title="Brooks & Dunn's Greatest Hits" />
is equivalent to the decimal representation:
<CD title="Brooks & Dunn's Greatest Hits" />
and to the hexidecimal representation:
<CD title="Brooks & Dunn's Greatest Hits" />
However, the next line is illegal because ampersand ("&") must be escaped by using either the entity reference or one of its numeric representations:
<CD title="Brooks & Dunn's Greatest Hits" />
This is because ampersand and less-than are special cases.
NOTE
You are required to use the predefined entities < and & to escape the characters < and & in all cases other than when these characters are used as markup delimiters, or in a comment, a processing instruction, or a CDATA section. In other words, the literal < and & characters can appear only as markup delimiters, or within a comment, a processing instruction, or a CDATA section.
Listing 3-3 illustrates the use of all five predefined character entities, several decimal representations of Greek letters, and the three legal variations of the Brooks & Dunn example. If we run this through an XML parser, we can verify that it is well-formed; we did not use the literal ampersand or the literal less-than before the word StockWatch. Figure 3-1 shows how this example looks in Internet Explorer, which renders the characters that are represented by the entities. It also confirms that the three Brooks & Dunn variations are equivalent.
Listing 3-3 Examples of Predefined Entities and Greek Letters (predefined-entities.xml)
<?xml version="1.0" standalone="yes"?> <Predefined> <Test>The hot tip from today's <StockWatch> column is: "AT&T stock is doing better than Ralph Spoilsports Motors' stock." </Test> <PS>Now, wasn't that as easy as Π? Or α, β, γ?</PS> <CD title="Brooks & Dunn's Greatest Hits" /> <CD title="Brooks & Dunn's Greatest Hits" /> <CD title="Brooks & Dunn's Greatest Hits" /> </Predefined>
FIGURE 3-1 Predefined entities displayed in Internet Explorer
HTML (and therefore XHTML) includes three large sets of predefined entities: Latin1, Special, and Symbols. You can pull these definitions into your XML document using external entities, covered in chapter 4. The files containing the entities are:
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
CDATA Sections
Sometimes it is necessary to indicate that a particular block of text should not be interpreted by the parser. One example is a large number of occurrences of the five predefined entities in a block of text that contains no markup, such as a section of code that needs to test for the numeric less-than or Boolean &&. In this case, we want text that would normally be considered markup to be treated simply as literal character data. CDATA sections are designated portions of an XML document in which all markup is ignored by the parser and all text is treated as character data instead. The main uses of CDATA sections are:
To delimit blocks of source code (JavaScript, Java, etc.) embedded in XML
To embed XML, XHTML, or even HTML examples in an XML document
The general syntax for a CDATA section is:
<![CDATA[ multi-line text block to be treated as character data ]]>
No spaces are permitted within the two delimiters "<![CDATA[" and "]]>".
Here's a CDATA section used to escape a block of code:
<![CDATA[ function doIt() { var foo = 3; var bar = 13; if (foo < 8 && bar > 8) alert("Help!"); else alert("I'm Down"); } ]]>
An example of embedded XML in XML follows.
<Example> <Number>2.4</Number> <XMLCode> <![CDATA[ <?xml version="1.0" standalone="no" ?> <!DOCTYPE Message SYSTEM "message.dtd"> <Message mime-type="text/plain"> <!-- This is a trivial example. --> <From>The Kenster</From> <To>Silly Little Cowgirl</To> <Body> Hi, there. How is your gardening going? </Body> </Message> ]]> </XMLCode> </Example>
In contrast to our earlier use of the Message example, the character data is not simply the three lines of content of the From, To, and Body elements. When this example is embedded within a CDATA section, the entire block is character data, which in this case means from the XML declaration to and including the </Message> end tag. In other words, the XML prolog, the comment, the start and end tags, and so on, are no longer markup; in this context, they constitute the character data contained by the CDATA section.