- What Makes an XML Document Well-Formed?
- Creating an Example XML Document
- Understanding the Well-Formedness Constraints
- Using XML Namespaces
- Understanding XML Infosets
- Understanding Canonical XML
- Summary
- Q&A
- Workshop
Understanding the Well-Formedness Constraints
The well-formedness constraints in the XML 1.0 specification are sprinkled throughout the document, and some of them are hard to dig out because they're not clearly marked. You'll get a look at the well-formedness constraints here, although note that some of them have to do with DTDs and entity references, and those will appear in Day 4, "Creating Valid XML Documents: Document Type Definitions," and Day 5, "Handling Attributes and Entities in DTDs."
Beginning the Document with an XML Declaration
The first well-formedness structure constraint is to start the document with an XML declaration. Even though some XML processors won't insist on it, W3C says you should always include this declaration first thing:
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?> <document> <employee> . . .
TIP
Although the XML 1.0 specification says that only the version attribute is required here, some softwarenotably including W3C's own Amaya testbed browserwill consider XML documents as not well-formed if you don't also include the encoding attribute.
Using Only Legal Character References
Another well-formedness constraint is that character references, which are character codes enclosed in & and ;, and which are replaced by the characters that code stands for, must only refer to characters supported by the XML specification.
This constraint is more or less obviousit simply means that you have to stick to the established character set for the version of XML you're using. Note that, as you saw yesterday, the characters that are legal in XML 1.0 differ somewhat from what's legal in XML 1.1.
Including at Least One Element
To be a well-formed document, a document must include one or more elements. The first element, of course, is the root element, so to be well-formed, a document must contain at least a root element. In other words, an XML document must contain more than just a prolog. Of course, your documents will usually contain many elements, as in our example document:
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?> <document> <employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> . . . </project> </projects> </employee> . . . </document>
Structuring Elements Correctly
HTML browsers are pretty easygoing about how you structure HTML elements in a Web page as long as they can understand what you're doing. For example, you can often omit closing tags in elementsyou might use a <p> tag and then follow it with another <p> tagwithout using a </p> tagand the browser will have no problem.
That's not the way things work in XML. In XML, every non-empty element must have both a start tag and an end tag, as in our example document:
<employee> <name> <lastname>Gable</lastname> <firstname>Clark</firstname> </name> <hiredate>October 25, 2005</hiredate> <projects> <project> <product>Keyboard</product> <id>555</id> <price>$129.00</price> </project> <project> <product>Mouse</product> <id>666</id> <price>$25.00</price> </project> </projects> </employee>
Besides making sure that every non-empty element has an opening tag and a closing tag, another well-formedness constraint says that end tags must match start tags, and both must use the same name.
Some elementsempty elementsdon't have closing tags. These tags have no content of any kind (although they can have attributes), which means that they do not enclose any character data or markup. Instead, these elements are made up entirely of one tag like this:
<?xml version = "1.0" standalone="yes"?> <document>
<heading text = "Hello From XML"/>
</document>
In XML, empty elements must always end with />.
TIP
HTML elements can also be ended with />, such as <BR/>, and HTML browsers will not have a problem with them. That's good, because the alternative is to write <BR></BR>, which some browsers, such as Netscape Navigator, interpret as two <BR> elements.
Using the Root Element to Contain All Other Elements
Another well-formedness constraint is that the root element must contain all the other elements in the document, as in our sample XML document, where we have three <employee> elements, which themselves contain other elements, in the document element:
<?xml version = "1.0" encoding="UTF-8" standalone="yes"?>
<document>
<employee>
.
.
.
</employee>
<employee>
.
.
.
</employee>
<employee>
.
. . </employee>
</document>
That's how a well-formed XML document worksyou start with a prolog, followed by the root element, which contains all the other the elements, if there are any. Among other things, containing all elements in a root element makes it easier for an XML processor to understand the structure of an XML documentstarting at the single root element, it can navigate the entire document.
Nesting Elements Properly
Nesting elements correctly is a big part of well-formedness; the requirement here is that if an element contains a start tag for a non-empty tag, it must also contain that element's end tag. In other words, you cannot spread an element over other elements at the same level. For example, this XML is nested properly:
<employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> </project> <project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee>
But as you can see, there's a nesting problem in this next element, because an XML processor will encounter a new <project> tag before finding the closing </project> tag it's looking for at the end of the current <project> element:
<employee> <name> <lastname>Kelly</lastname> <firstname>Grace</firstname> </name> <hiredate>October 15, 2005</hiredate> <projects> <project> <product>Printer</product> <id>111</id> <price>$111.00</price> <project> </project> <product>Laptop</product> <id>222</id> <price>$989.00</price> </project> </projects> </employee>
In fact, this nesting requirement is where the whole term well-formed comes fromthe original idea was that a document where the elements were not garbled and mixed up with each other was well-formed.
There are other well-formedness constraints that have nothing to do with elements, howeverfor example, the next two concern attributes.
Making Attribute Names Unique
Another well-formedness constraint is that you can't use the same attribute more than once in one start-tag or empty-element tag. This is another well-formedness constraint that seems more or less obvious, and it's hard to see how you might violate this one except by mistake, as in this case:
<message text="Hi there!" text="Hello!">
XML is case sensitive, so you could theoretically do something like this:
<message Text="Hi there!" text="Hello!">
Obviously, that's not a very good idea, however; attribute names that differ only in capitalization are bound to be confusing.
Enclose Attribute Values in Quotation Marks
One well-formedness constraint that trips up most XML novices sooner or later is that you must quote every value you assign to an attribute, using either single quotation marks or double quotation marks. This trips many people up because you don't have to quote attribute values in HTML, as in this HTML example (which also doesn't have a closing tag):
<img src=mountains.jpg>
An XML processor would have problems with this element, however. Here's what it would look like properly constructed:
<img src="mountains.jpg" />
If you prefer, you could use single quotation marks:
<img src=mountains.jpg' />
As you've seen, using single quotation marks helps when an attribute's value contains quoted text:
<message text='I said, "No, no, no!"' />
And as you've also seen, in worst-case scenarios, where an attribute value contains both single and double quotation marks, you can escape " as " and ' as 'as here, where you're reporting the height of a tree as 50' 6" :
<tree type="Maple" height="50'6"" />
Avoiding Entity References and < in Attribute Values
Also, W3C makes it an explicit well-formedness constraint that you should avoid references to external entities (this means XML-style referencesgeneral entity references or parameter entity references, not just, for example, using an image file's name) in attribute values. This means that an XML processor doesn't have to replace an attribute value with the contents of an external entity.
In addition, another constraint says that you are not supposed to use < in attribute values, because an XML processor might mistake it for markup. If you really have to use the text <, use < instead, which will be turned into < when parsed. For example, this XML:
<project note="This is a <project> element.">
should be written as this, where you're escaping both < and >:
<project note="This is a <project> element.">
In fact, < is a particularly sensitive character to use anywhere in an XML document, except as markup, and that's another well-formedness constraint concerning <, coming up next.
Avoiding Overuse of < and &
XML processors assume that < starts a tag and & starts an entity reference, so you should avoid using those characters for anything else. Sometimes, this is a problem, as in the JavaScript example you saw yesterday, which uses the JavaScript < operator that enclosed in a CDATA section:
<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title> Checking the temperature </title> </head> <body> <script language="javascript">
<![CDATA[
var temperature
temperature = 234.77
if (temperature < 32) {
document.writeln("Below freezing!")
}
]]>
</script> <center> <h1> Checking the temperature </h1> </center> </body> </html>
However, because modern Web browsers don't understand CDATA sections, this solution (which was suggested by W3C) doesn't really work. And if you escape the > operator as <, very few browsers will understand what you're doing.
There are two main ways of handling the < JavaScript operator in XML with today's browsers. You can reverse the logical sense of the testfor example, in this case, instead of checking whether the temperature is below 32, you would check to make sure it isn't above or equal to 32, which lets you use > instead of < (note that the JavaScript ! operator, the Not operator, reverses the logical sense of an expression) :
<script language="javascript"> var temperature temperature = 234.77
if (!(temperature >= 32)) {
document.writeln("Below freezing!") } </script>
Practically speaking, the best way is usually to remove the whole problem by placing the script code in an external file, which you'll name script.js here, so the browser won't parse it as XML in the first place. You can do that like this in JavaScript (more on JavaScript and how to use it in XML is coming up in Day 15, "Using JavaScript and XML"):
<?xml version="1.0"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/tr/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title> Checking the temperature </title> </head> <body>
<script language="javascript" src="script.js">
</script>
<center> <h1> Checking the temperature </h1> </center> </body> </html>
That completes today's discussion of well-formedness, although you'll see more in the next two days as we discuss the well-formedness constraints that have to do with DTDs.
As your XML documents evolve and become more complex, it's also going to be increasingly important to understand namespaces, which are the second major topic for today.