The Query Data Model
The first step in designing XQuery was to specify the data model on which the language operates. The Query data model [XQ-DM] represents XML data in the form of nodes and values, which serve as the operands and results of the XQuery operators. XQuery is closed under the Query data model, which means that the result of any valid XQuery expression can be represented in this model. Since all the operators and expressions of XQuery are defined in terms of the Query data model, understanding this model is the key to understanding the language.
In defining the Query data model, the working group did not intend to deviate from existing standards, but to conform to them wherever possible. Therefore, the Query data model draws from several previously existing specifications. The information that results from parsing an XML document is specified by the XML Information Set [INFOSET], in the form of a collection of information items. The XML Information Set (or Infoset) contains no type information, and it represents data at a very primitive levelfor example, every character has its own information item. XML Schema specifies an augmented form of the XML Infoset called the Post-Schema Validation Infoset, or PSVI. In the PSVI, information items that represent elements and attributes have type information and normalized values that are derived by a process called schema validation. The PSVI contains all the information about an XML document that is needed for processing a query, and the Query data model is based on the information contained in the PSVI.
For reasons described in the next section, the working group decided early to include the existing language XPath [XPATH1] as a subset of XQuery. XPath provides a notation for selecting information within an existing XML document, but it does not provide a way to construct new XML elements. Section 5 of the XPath specification shows how to represent the information in the XML Infoset in terms of a tree structure containing seven kinds of nodes. The operators of XPath are defined in terms of these seven kinds of nodes. In order to retain the original XPath operators and still take advantage of the richer type system of XML Schema, the XQuery designers decided to augment the XPath data model with the additional type information contained in the PSVI. The result of this process is the Query data model. The Query data model can be thought of as representing the PSVI in the form of a node hierarchy, much as the XPath data model represents the XML Infoset in the form of a node hierarchy.
In the Query data model, every value is an ordered sequence of zero or more items. An item can be either an atomic value or a node. An atomic value has a type, which is one of the atomic types defined by XML Schema or is derived from one of these types by restriction. A node is one of the seven kinds of node defined by XPath, called document, element, attribute, text, comment, processing instruction, and namespace nodes. Nodes have identity, and an ordering called document order is defined among all the nodes that are in scope.
An instance of the Query data model may contain one or more XML documents or fragments of documents, each represented by its own tree of nodes. The root node of the tree that represents an XML document is a document node. Each element in the document is represented by an element node, which may be connected to attributes (represented by attribute nodes) and content (represented by text nodes and nested element nodes). The primitive data in the document is represented by text nodes, which form the leaves of the node tree.
Figure 2.1 illustrates the Query data model representation of a simple XML document. Nodes are represented by circles labeled D for document nodes, E for element nodes, A for attribute nodes, and T for text nodes. The XML document represented by Figure 2.1 is shown in Listing 2.1:
Listing 2.1 XML Document Represented by Figure 2.1
<?xml version="1.0" ?> <procedure title="Removing a light bulb"> <time unit="sec">15</time> <step>Grip bulb.</step> <step> Rotate it <warning>slowly</warning> counterclockwise. </step> </procedure>
Figure 2.1 Example of the Query Data Model
In the Query data model, each element or attribute node has a name, a string value, a type annotation, and a typed value. These properties are not independent. The type annotation of an element represents its type as determined by the schema validation process. An element that has not been validated, or for which no more specific type is known, has the type annotation xs:anyType, where xs: is a prefix representing the namespace of XML Schema. If an element has no descendant elements, then its typed value can be derived from its string value and its type annotation.
Bear in mind that the type of an element describes the potential content of the element and does not depend on the name of the element. For example, two elements named cost and price could both have the type annotation decimal because they both require decimal content. Similarly, two elements named shipto and billto could both have the type annotation address, which might be a complex type defined in a schema that describes the potential content of the elements.
XQuery is defined as a transformation from one instance of the Query data model to another instance of the Query data model. This simplifies the definition of XQuery but leaves open the issues of where input data comes from and how output data is delivered to applications. A query gains access to input data by calling an XQuery input function such as document or collection, or by referencing some part of the external context (such as a prebound variable or "current node." Each of these input methods is defined to return a Query data model instance in the form of one or more node hierarchies. One way in which a node hierarchy could be created is by parsing an XML document, validating it against a known or default schema, and converting the resulting PSVI into the Query data model as described in [XQ-DM]. Another way is for a system to store XML documents in a pre-validated form so that their Query data model representation can be materialized quickly on demand. A third way is for the Query data model to be synthesized directly from some data source such as a relational database, deriving its type information from "metadata" in the database catalog.
The process of serializing a Query data model instance as a linear XML document remains unspecified at present. All XML documents can be represented using the Query data model, but not all instances of the Query data model are valid XML documents. For example, the result of a query might be a sequence of atomic values, or an attribute that is not attached to any element. Mechanisms for serializing these values and for binding them to variables in a host programming language remain to be specified.