- Introduction
- The Need for an XML Query Language
- Basic Principles
- The Query Data Model
- Related Languages and Standards
- Watershed Issues
- Conclusion
The Need for an XML Query Language
Early in its history, the XML Query Working Group confronted the question of whether XML is sufficiently different from other data formats to require a query language of its own. The SQL language [SQL99] is a very well established standard for retrieving information from relational databases and has recently been enhanced with new facilities called "structured types" that support nested structures similar to the nesting of elements in XML. If SQL could be further extended to meet XML query requirements, developers could leverage their considerable investment in SQL implementations, and users could apply the features of these robust and mature systems to their XML databases without learning a completely new language.
Given these incentives, the working group conducted a study of the differences between XML data and relational data from the point of view of a query language. Some of the significant differences between the two data models are summarized below.
Relational data is "flat"that is, organized in the form of a two-dimensional array of rows and columns. In contrast, XML data is "nested," and its depth of nesting can be irregular and unpredictable. Relational databases can represent nested data structures by using structured types or tables with foreign keys, but it is difficult to search these structures for objects at an unknown depth of nesting. In XML, on the other hand, it is very natural to search for objects whose position in a document hierarchy is unknown. An example of such a query might be "Find all the red things," represented in the XPath language [XPATH1] by the expression //*[@color = "Red"]. This query would be much more difficult to represent in a relational query language.
Relational data is regular and homogeneous. Every row of a table has the same columns, with the same names and types. This allows metadatainformation that describes the structure of the datato be removed from the data itself and stored in a separate catalog. XML data, on the other hand, is irregular and heterogeneous. Each instance of a web page or a book chapter can have a different structure and must therefore describe its own structure. As a result, the ratio of metadata to data is much higher in XML than in a relational database, and in XML the metadata is distributed throughout the data in the form of tags rather than being separated from the data. In XML, it is natural to ask queries that span both data and metadata, such as "What kinds of things in the 2002 inventory have color attributes," represented in XPath by the expression /inventory[@year = "2002"]/*[@color]. In a relational language, such a query would require a join that might span several data tables and system catalog tables.
Like a stored table, the result of a relational query is flat, regular, and homogeneous. The result of an XML query, on the other hand, has none of these properties. For example, the result of the query "Find all the red things" may contain a cherry, a flag, and a stop sign, each with a different internal structure. In general, the result of an expression in an XML query may consist of a heterogeneous sequence of elements, attributes, and primitive values, all of mixed type. This set of objects might then serve as an intermediate result used in the processing of a higher-level expression. The heterogeneous nature of XML data conflicts with the SQL assumption that every expression inside a query returns an array of rows and columns. It also requires a query language to provide constructors that are capable of creating complex nested structures on the flya facility that is not needed in a relational language.
Because of its regular structure, relational data is "dense"that is, every row has a value in every column. This gave rise to the need for a "null value" to represent unknown or inapplicable values in relational databases. XML data, on the other hand, may be "sparse." Since all the elements of a given type need not have the same structure, information that is unknown or inapplicable can simply not appear. This gives an XML query language additional degrees of freedom for dealing with missing data. The XQuery approach to representing unknown or inapplicable data is discussed under Issue 2 under "Watershed Issues" below.
In a relational database, the rows of a table are not considered to have an ordering other than the orderings that can be derived from their values. XML documents, on the other hand, have an intrinsic order that can be important to their meaning and cannot be derived from data values. This has several implications for the design of a query language. It means that queries must at least provide an option in which the original order of elements is preserved in the query result. It means that facilities are needed to search for objects on the basis of their order, as in "Find the fifth red object" or "Find objects that occur after this one and before that one." It also means that we need facilities to impose an order on sequences of objects, possibly at several levels of a hierarchy. The importance of order in XML contrasts sharply with the absence of intrinsic order in the relational data model.
The significant data model differences summarized above led the working group to decide that the objectives of XML queries could best be served by designing a new query language rather than by extending a relational language. Designing a query language for XML, however, is not a small task, precisely because of the complexity of XML data. An XML "value," computed by a query expression, may consist of zero, one, or many items, each of which may be an element, an attribute, or a primitive value. Therefore, each operator in an XML query language must be well defined for all these possible inputs. The result is likely to be a language with a more complex semantic definition than that of a relational language such as SQL.