8.3 Processing Performance
One of XML's great strengths is the enormous selection of libraries, applications, toolkits, and frameworks available to developers in nearly every programming language and on nearly every platform. Sometimes, especially in the case of such low-level components as parsers, this software has been heavily optimized for size or performance. In other cases, however, publicly available XML software is optimized instead for ease of use; when that happens, the software can hide design choices that trigger enormous size or performance hits while processing XML.
When a high-level component, such as an XSLT engine, becomes an efficiency bottleneck, developers need to understand how their tools work and where the costs come from. This section describes some of the optimizations available to developers to improve XML performance for parsing, querying, and transforming XML and then introduces the technique of XML pipelines.
8.3.1 Parsing Interfaces
Although people have come up with some innovative ideas for XML parsing interfaces, such as pull parsers and query-based parsers, nearly all XML parsing interfaces in common use fall into one of two groups:
-
Tree-based interfaces, such as the Document Object Model [DOM]
-
Event-based interfaces, such as the Simple API for XML [SAX]
DOM-like interfaces allow applications to navigate and modify an XML document in much the same way that they navigate and modify a file system, by moving up and down various branches. As a matter of fact, some XML user interfaces even present XML documents using file system iconography, with folder icons for elements and file icons for text.
Unfortunately, the ease of use of DOM-like interfaces comes at a high price: An XML document's entire structure must be available for random access. In practice, that means that the structure must be in a database or, most commonly, in computer memory. During batch processing, building that in-memory data structure requires a time for allocating objects [1] and memory for keeping them, typically five to ten times the size of the original XML document. For a desktop application, these requirements are not usually a problem: A user editing a 2MB XML document will not mind waiting a few seconds for the document to load and can easily spare 20MB of memory to edit it. However, on a system handling hundreds or thousands of XML documents simul taneously and quickly, such as a server or high-speed information systemor even worse, a network appliancethe memory and time overhead are entirely unacceptable.
In these cases, switching to an event-based interface, such as SAX, may be a good choice. SAX-like interfaces do not allow random access to an XML documentalthough a DOM-like interface is similar to a hard drive, a SAX-like interface is more similar to a tape drivebut the SAX-like interface uses near-constant memory no matter how large an XML document may be [2] and tends to use very little processing time, as it does not need to allocate new objects during its run.
Unfortunately, programming with a SAX-like parsing interface is considerably more complicated than programming with a DOM-like interface. The application needs to capture and store any information it requires as the information streams past, and there is no way to follow backward references in a document. When ease of implementation is the priority, a DOM-like interface makes sense; when performance is the priority, a SAX-like interface makes sense.
Often, applications parsing data-oriented XML documents build their own, in-memory data trees.
-
An accounting application might build a tree of ledgers, accounts, and entries.
-
A geodata application might build a tree of polygons and points.
-
A graphics program might build a tree of shapes and lines.
An application that builds a specialized data tree using a DOM-like XML parsing interface takes a double hit: First, the parser uses a lot of time and memory to build a parse tree of the XML document, and then the application uses more time and memory to build its own tree before discarding the parser's tree. In such a case, simply switching to an event-based interface can reduce an application's time and memory requirements for XML processing to a fraction of what they originally were.
Another place that tree-based interfaces often hide performance is in transformations, discussed in Section 8.3.3.
8.3.2 Queries
Sometimes, an application reads an XML document simply to extract a small bit of information hidden somewhere inside it. A bibliographical application might need to extract only the title and the author's name from an XML-encoded research paper, like the one in Listing 8-3.
Example 8-3. Bibliographical Information in a Research Paper
<article> <articleinfo> <title>Frog populations in Frontenac County</title> <author> <honorific>Dr.</honorific> <firstname>Jose</firstname> <surname>Schmidt</surname> <affiliation> <orgdiv>Department of Biology</orgdiv> <orgname>King's University</orgname> </affiliation> </author> </articleinfo> <sect1> <title>Introduction</title> <para>During the summers of 1999 and 2000, a team of researchers [...]</para> [...] </sect1> [...] </article>
If the paper were 100 pages long, building a DOM tree of the entire XML document simply to extract the title and author information would be extremely wasteful. It is much better to write a short, perhaps 50-line, program in Perl, Java, Python, C, or C++ to extract the required information from an event stream provided by a SAX-like interface. In fact, the application can terminate parsing as soon as it has all the required information, so the parser will not need to read most of the document at all, much less build it into an in-memory tree.
Unfortunately, custom programming is not always a practical solution: Many people need to extract information from XML documents in a more generic way that does not require hard-core computer programming skills. For situations like these, the XML community has largely settled on XPath [XPATH] as a simple query syntax for XML documents. [3] For example, the following XPath expressions select the element containing the article title and the subtree containing the author's name:
/article/articleinfo/title /article/articleinfo/author
A general-purpose query application can use these expressions to return matching fragments from an XML document, the same way that the UNIX grep command uses regular expressions to return matching fragments from a text file. An interactive XML editor or browser can use these expressions to bring the cursor to the matching points in a document.
Unfortunately, it is necessary to have random access to the original XML document to support all of XPath, and that, once again, requires a DOM-like interface. In a high-speed environment, in which predictable, consistent performance is important, it may be necessary to restrict queries to the subset of XPath that can be supported by a streaming interface. Both of the previous XPath expressions, for example, could be resolved by an application using a SAX-like parsing interface, as they do not require any backward references.
The XPath 1.0 expressions that can be matched most efficientlyfor time and memoryare the ones that limit themselves mostly to context on the element stack, particularly parent, child, ancestor, and descendant relationships. For example, all the following XPath expressions could be matched against output from an event-based XML parser, such as a SAX parser, storing no context except for a stack of elements and their attributes up to the root element and the index of each element in the stack relative to its siblings:
//caution /book//chapter/title[contains(string(), "Crimson")] //section[@id="preface"]//character[string()="Joseph"]
-
The first XPath expression simply requires matching the caution element, with no additional context.
-
The second expression requires matching the current element against title and examining the element stack to ensure that the parent element is chapter and that the root element is book, then testing whether the string "Crimson" appears in the element's content.
-
The third expression requires matching the current element against character, testing whether that section element with the id attribute equal to "preface" appears anywhere in the element stack, and testing that the element's content is exactly the string Joseph.
These are all queries that will use near-constant memory no matter how long an XML doc ument is, as none requires random access. The following inefficient expressions, on the other hand, do require random access to a document or else the caching of a large part of a document's contents:
//caution[../para[contains(string(), "oil")]] /book//chapter[following-sibling:chapter/title[string()="Crimson"]] //link[id(@ref)/@status="modified"]
-
The first XPath expression matches any caution element that has a sibling named para containing oil in its content. As a result, a query engine will have to save the content of every caution element until the parser has finished reporting all its siblings.
-
The second expression matches any chapter that appears before a chapter with the title "Crimson". As a result, the query engine will have to cache the content of all chapters until the parser is finished processing the parent element.
-
The third expression is the most inefficient: In the worst case, it will require the query engine to cache every link element until the end of the document.
Fortunately, the efficiency concerns for XPath expressions are not so serious, because many technical specialists, not to mention ordinary users, will find such expressions anywhere from excessively complicated to baffling. Limiting a high-performance query engine to efficient expressions still provides a respectable amount of query capability while also allowing an application all the advantages of a SAX-like interface (see Section 8.3.1): Modifying an application to use only this subset, together with a SAX-like parser, can bring significant speed and memory improvements to an XML application.
8.3.3 Transformations
Performance problems with transformations are closely related to the problems with parsing interfaces (Section 8.3.1) and queries (Section 8.3.2). The most popular specification for transforming XML documents into XML or another format is XSL Transformations [XSLT], but because XSLT supports the full XPath specification, XSLT also requires random access to the original XML document. Therefore, XSLT engines typically use a DOM-like interface with all the extra time and memory overhead.
XSLT's expressive power, and the fact that it is a template language, make it ideal for complex structural transformations; in those cases, extra time and memory might be a reasonable tradeoff for rapid development. On the other hand, many transformations are small and do not require the ability to restructure a document extensively. In these cases, the significant overhead of using XSLT processors other tree-based transformation tools is unacceptable.
Before considering alternatives, this section takes a quick look at the efficiency of XSLT engines. XSLT engines read an XML document as input and create a new document, possibly XML, as output. Although the engines require random access to the input document, they have no such requirement for output, so there is no need to build a second in-memory document tree. Command line XSLT utilities are generally well-enough designed that they do not build a second DOM tree, or the equivalent, but simply write out the result as a stream, However, if an XML project happens to use an XSLT library, a developer could end up using the library to generate a DOM tree, for XML, before writing any output. This is, obviously, a bad idea when memory is at a premium, and fixing this problem is an easy way to halve the memory consumption of any project that uses XSLT.
A second optimization is possible with XSLT, but to date, no one seems to have taken advantage of it. XSLT requires random access to the input document but does not modify the original document; as a result, it is possible to parallelize XSLT processing, assigning different processors to perform different parts of the transformations on the input document. [4] This approach will not solve the memory problem but could bring significant speed improvements for large input documents or complex transformation rules.
As mentioned earlier, however, XSLT's ability to access the input document randomly using XPath expressions is not necessary for many kinds of transformations, and an application working with an event stream, such as SAX, can run much faster, using very little memory. SAX-like transformations are complicated or impossible when the transformation requires major structural changes, but such transformation works well in three cases:
-
Adding information to a document
-
Removing information to a document
-
Modifying information in place, such as renaming elements and attributes or changing text
Many typical transformations fall into these categories. For example, one common transformation for technical and sales publications is adding live data from a database at publication time. The original XML markup might look like that in Listing 8-4.
Example 8-4. Catalog Entry before Transformation
<item> <source>Henrickson</source> <name>Acoustic Guitar</name> <description>This guitar includes a mahogany fret board, gold-plated tuning pegs, and a solid top.</description> <price dbref="g3905778/price"/> </item>
The transformation engine executes a database query based on the value of the dbref attribute in the price element and replaces it with the latest price information, as shown in Listing 8-5.
Example 8-5. Catalog Entry after Transformation
<item> <source>Henrickson</source> <name>Acoustic Guitar</name> <description>This guitar includes a mahogany fret board, gold-plated tuning pegs, and a solid top.</description> <price> <status>sale</status> <expiry-date>2005-01-01</expiry-date> <currency>USD</currency> <amount>1999.00</amount> </price> </item>
A tree-based transformation engine, such as XSLT, brings little advantage to this kind of work. In fact, because XSLT does not specify a standard method for database access, any XSLT template would not be portable anyway, even if the engine did support database access. On the other hand, using Perl, Python, or Java, a transformation like this is trivial for an experienced developer.
8.3.4 Pipelines
Another performance problem that can creep into an XML processing system is duplicated parsing. An XML document will often pass through several components in a chain from the point it enters a system to the point it leaves; if the XML is written to disk and then reparsed into memory each time a component touches it, a system might slow down. [5] The solution to this problem is to use a processing pipeline like the one in Figure 8-1.
Figure 8-1 XML processing in a pipeline
The XML document enters the pipeline through the parser on the left, which converts the document to a series of SAX or SAX-like events. The parser then passes the events on to the first component, which manipulates them as necessarysay, by adding or removing informationthen passes the modified events on to another component. The second component knows nothing about the first component: As far as it is concerned, the events could be coming directly from a parser. The second component makes further changes to the event stream and then passes the modified events on to the third component, and so on. At the end of the chain is a component that simply writes the events back out to disk or sends them out over the network in XML format. The alternative to this pipeline would be to parse and rewrite the XML document for each processing component, as in Figure 8-2: Note all the extra work required for processing.
Figure 8-2 XML processing without a pipeline
The pipeline concept works with both DOM-like interfaces and SAX-like interfaces. For DOM-like interfaces, the components all work on the same in-memory tree in sequence, making changes and then passing the tree on to the next component. The beginning of the pipeline is also a parserand tree build, in this caseand the last component is also a writer.
Aside from the fact that, for efficiency, the components need to run on the same system, this approach has one problem: Without XML to examine between each component, it can be difficult to find and fix problems. Fortunately, however, the pipeline components are modular, so it is easy to fit in extra debugging components to write out the events as XML at various points in the processing.
The pipeline model is the most common one used by skilled XML developers working with SAX and other event-based interfaces, and it has proved itself in high-speed, high-demand environments. The model is not quite as common with the DOM and other tree-based interfaces, as people tend not to use those in high-speed environments in the first place, but it can work equally well.