XML Syntax and Parsing Concepts
- Elements, Tags, Attributes, and Content
- XML Document Structure
- Markup, Character Data, and Parsing
- XML Syntax Rules
- Well-Formed vs. Valid Documents
- Event-Based vs. Tree-Based Parsing
- Summary
- For Further Exploration
In this chapter, we cover the rules of XML syntax that are stated or implied in the XML 1.0 Recommendation from the W3C. A considerable amount of XML terminology is introduced, including discussions of parsing, well-formedness, and validation. XML document structure, legal XML Names, and CDATA are also among the topics. The XML 1.0 specification also discusses rules for Document Type Definitions (DTDs), which we present in chapter 4. The material in chapters 3 and 4 is very interrelated.
Elements, Tags, Attributes, and Content
To understand XML syntax, we must first be familiar with several basic terms from HTML (and SGML) terminology. XML syntax, however, differs in some important ways from both HTML and SGML, as we'll see.
Elements are the essence of document structure. They represent pieces of information and may or may not contain nested elements that represent even more specific information, attributes, and/or textual content. In our employee directory example from chapter 2 (Listing 2-2), some of the elements were Employees, Employee, Name, First, Last, Project, and PhoneNumbers.
Tags are the way elements are indicated or marked up in a document. For each element,1 there is typically a start tag that begins with < (less than) and ends with > (greater than), and an end tag that begins with </ and ends with >. Some of the start tags in our example were <Employees>, <Employee>, <Name>, and so forth. The corresponding end tags for these elements were </Employees>, </Employee>, and </Name>.
If an element has one or more attributes, they must appear between the < and > delimiters of the start tag. Attributes are qualifying pieces of information that add detail and further define an instance of an element. They are typically details that the language designer feels do not need to be nested elements themselves; the assumption is that the attributes will generally be accessed less often than the elements that contain them, but this tends to be application dependent.2 In our employee example, the only element that had an attribute was Employee, and the attribute was sex, with two kinds of instances:
<Employee sex="male">
or
<Employee sex="female">.
Each attribute has a value, the quoted text to the right of the equal sign. In the preceding examples, the values of the two instances of the sex attribute are "male" and "female". Although in this case the value is a single word, values can be any amount of text, enclosed in single or double quotes. HTML permits attributes that do not require values (e.g., the selected attribute to denote a default choice in a form, as in <OPTION selected>), but this so-called attribute minimization is expressly not permitted in XML.
Content is whatever an element contains. Sometimes element content is simply text. In other cases, elements contain nested elements; the inner (child) elements are called the content of the outer (parent) element. Content is the data that the element contains. For example, in this fragment:
<Address> <Street>123 Milky Way</Street> <City>Columbia</City> <State>MD</State> <Zip>20777</Zip> </Address>
"123 Milky Way" is the text content of the Street element, "Columbia" is the text content of the City element, and Street, City, State, and Zip are all nested element content of the parent Address element, in other words, "123 Milky Way Columbia MD 20777". (The space preceding the last three words is due to newlines, as we'll see.)
Notice that the content of Zip is the text string "20777". Why do we not say that this is a number or, better yet, an example of some zip code datatype (constrained to either the valid five-digit or five-plus-four-digit ddddd-dddd values for zip codes)? Because there is nothing about the Zip element that conveys its content is numeric! We could, however, denote the element's datatype explicitly by means of an attribute.
<Zip type="integer">20777</Zip>
We'll eventually see how an alternative to DTDs called XML Schema makes data typing easier and far more flexible.
Another possibility, called mixed content, was illustrated in chapter 2 in the section "Document-Centric vs. Data-Centric," in which both text and element content may appear as the content of a parent element. We'll see how to handle this in chapter 4.