Markup, Character Data, and Parsing
An XML document contains text characters that fall into two categories: either they are part of the document markup or part of the data content, usually called character data, which simply means all text that is not part of the markup. In other words, XML text consists of intermingled character data and markup. Let's revisit an earlier fragment.
<Address> <Street>123 Milky Way</Street> <City>Columbia</City> <State>MD</State> <Zip>20777</Zip> </Address>
The character data comprises the four strings "123 Milky Way", "Columbia", "MD", and "20777"; the markup comprises the start and end tags for the five elements Address, Street, City, State, and Zip. Note that this is similar but not identical, to what we previously called content. For example, although each chunk of character data is the content of a particular element, the content of the Address element is all of the child elements. We can think of all the character data belonging to both the element that directly contains it and indirectly to Address. (In fact, in some XML applications such as XSLT, if we ask for the text content of Address, we'll get the concatenation of all the individual strings.)
The markup itself can be divided into a number of categories, as per section 2.4 of the XML 1.0 specification.
start tags and end tags (e.g., <Address> and </Address> )
empty-element tags (e.g., <Divider/> )
entity references (e.g., &footer; or %otherDTD; )
character references (e.g., < or > )
comments (e.g., <!-- whatever --> )
CDATA section delimiters (e.g., <![CDATA[ insert code here ]]> )
document type declarations (e.g., <!DOCTYPE ....> )
processing instructions (e.g., <?myJavaApp numEmployees="25" location="Columbia" .... ?> )
XML declarations (e.g., <?xml version=.... ?> )
text declarations (e.g., <?xml encoding=.... ?> )
any white space at the top level (before or after the root element)
We will discuss each of these markup aspects in either this chapter or the next. Note that for all types of markup, there are some delimiters, most but not all of which involve angle brackets.
The specification states that all text that is not markup constitutes the character data of the document. In other words, if you stripped all markup from the document, the remaining content would be the character data. Consider this example:
<?xml version="1.0" standalone="no" ?> <!DOCTYPE Message SYSTEM "message.dtd"> <Message mime-type="text/plain"> <!-- This is a trivial example. --> <From>The Kenster</From> <To>Silly Little Cowgirl</To> <Body> Hi, there. How is your gardening going? </Body> </Message>
The character data when the markup is removed would be:
The Kenster Silly Little Cowgirl Hi, there. How is your gardening going?
In general this is essentially the text between the start and end tags, which we previously called the content of the element, but there is a subtlety related to parsing. Depending on parser details, the newlines after </From> and </To> might be replaced by single spaces, as shown. Alternatively, the newlines might be preserved.
Parsing is the process of splitting up a stream of information into its constituent pieces (often called tokens). In the context of XML, parsing refers to scanning an XML document (which need not be a physical fileit can be a data stream) in order to split it into its various markup and character data, and more specifically, into elements and their attributes. XML parsing reveals the structure of the information since the nesting of elements implies a hierarchy. It is possible for an XML document to fail to parse completely if it does not follow the well-formedness rules described in the XML 1.0 Recommendation. A successfully parsed XML document may be either well-formed (at a minimum) or valid, as discussed in detail later in this chapter and the next.
There is a subtlety about processing character data. During the parsing process, if there is markup that contains entity references, the markup will be converted into character data. A typical example from XHTML would be:
<p>"AT&T is a winning company," he said.</p>
After the parser substitutes for the entities, the resultant character data is:
"AT&T is a winning company," he said.
After parsing and substituting for special characters, the character data that remains after the substitution is parsed character data, which is referred to as #PCDATA in DTDs and always refers to textual content of elements. Character data that is not parsed is called CDATA in DTDs; this relates exclusively to attribute values.