XML Performance and Size
- Advantages of XML for Size and Performance
- Disadvantages of XML for Size and Performance
- Processing Performance
- Size
- Final Words on Performance and Size
XML has a reputation for being big and slow.
-
An XML document is often largersometimes much largerthan equivalent files in other formats, requiring more memory, disk storage, and network bandwidth.
-
XML files need to go through complex parsing and transformations, which can take up considerable processing power and time.
An XML document can be larger in two ways: (1) in its proper XML form, requiring more storage space and bandwidth; and (2) in its compiled, in-memory form, requiring more computing resources.
Sometimes, XML deserves this reputation. Building a Document Object Model [DOM] tree, for example, or performing an XSLT [XSLT] transformation can use up a surprising amount of time and memory. Often, developers miss these problems while building a proof of concept but then run into them hard while building a production system with a full traffic load.
XML's very nature causes some of these problemsa plaintext format with frequent and repeated labels is bound to be a bit big and a bit slowbut XML's designers decided that it was worth trading some size and efficiency for the advantages of a portable, transparent information format. This has been a winning tradeoff in the past: Most of the popular Internet formats, such as SMTP, FTP, HTTP, POP3, and TELNET, also use plaintext, and their transparency and simplicity caused them to win out over more optimized but less transparent competitors.
On the other hand, many of the worst performance problems people run into with XML are a result of the tools and libraries they choose and the way they use them. Toolkits often hide large size and performance costs behind a simple interface: One or two function calls can trigger an exponential time and space explosion behind the scenes. When this happens, developers do not have to give up on XML, but they sometimes have to abandon their toolkits and do more work by hand. This chapter introduces some of the tips and tricks to work around XML's imperfections in a high-performance environment.
8.1 Advantages of XML for Size and Performance
Because XML markup is text, some developers assume that it will always be more verbose and slower to process than binary format. In fact, an optimized binary format can be efficient when the data structure is highly consistent, but when structure variesfor example, with optional or repeatable fields or hierarchical relationshipsa binary format can end up as large as an XML file, sometimes even larger.
Likewise, people assume that parsing XML will always be slower than parsing a binary format, but in practice, tool support sometimes cancels out that different: Because XML is so widely used, many free-software developers and commercial vendors have put a lot of time into profiling and optimizing the programs that do low-level XML parsing.
8.1.1 Space Efficiency
When data is in a highly consistent, predictable format, such as a table (see Section 4.3.1), it is possible to build efficient binary formats for storing it. For example, a binary file containing a table of 1,000 rows of 8 short integers will require only 16 octets for each row, or 16,000 octets in total, or even less with some compression schemes. Compare that with a row with typical XML markup in Listing 8-1.
Example 8-1. A Row of Integers in XML
<row> <num>18</num> <num>11</num> <num>64</num> <num>23</num> <num>5</num> <num>65</num> <num>2</num> <num>10</num> </row>
Even using fairly succinct XML names, such as row and num, the XML example requires 130 octets for each row, or eight times as much storage space as the binary format.
The binary format is space efficient because it can make assumptions about the data: There is no need to label rows or items, because a new item will start every 2 octets, and a new row will start every 16 octets. But what happens if the data is not quite so regular? The binary format will have to start adding extra information.
-
If the rows have variable lengthsthat is, items can be omitted from the endthe binary file will have to store the length of each row, adding an overhead of 1 or 2 octets for each row.
-
If individual items have variable lengths, such as textual data, the binary file will have to store the length of each item, adding an overhead of 1 or 2 octets for each item.
-
If items can be omitted or repeated in the middle of rows, the binary file will have to label each item so that it will be clear what has been repeated or omitted, adding an overhead of typically 4 octets for each item, assuming pointers into a name table.
-
If the data is not simple rows of items but can have a more complex structure of nodes, the binary file may have to maintain navigational points, such as parent, first child, and next sibling, adding typically 4 octets for each node.
-
If leaf and branch nodes can be mixed at the same level, as in XML or HTML mixed content, nodes will require type information, adding typically 1 octet for each node.
Accordingly, Table 8.1 shows that the memory requirement for item, or node, is now 17 octets, excluding the overhead of keeping one unique copy of each element name in a lookup table. By comparison, an item as an XML element can take as few as 4 octets <a/>, depending on the name length and encoding.
Table 8-1. A Binary Node
Property |
Length (octets) |
---|---|
Type |
1 |
Name pointer |
4 |
Parent pointer |
4 |
Next-sibling pointer |
4 |
First-child pointer |
4 |
Total |
17 |
Even when both the start and end tags appear and the name is longer, such as table, the overhead for the start and end tags will be only twice the name length plus 5 octets in UTF-8 encoding, or, in this case, 15 octets, still shorter than the overhead in the binary format. Furthermore, the binary format does not yet have any mechanism for representing the equivalent of XML attributes, which would add yet more pointers and other overhead to it.
In the end, then, it is not the fact that XML is a text-based format that makes it verbose; rather, it is the fact that XML can encode very complex structure. When that extra structure is not necessary, an XML document can also be concise:
<r>1 2 3 4 5 6 7 8</r>
The 22 octets required for that data row in XML compare favorably to the 16 octets required for the data row in binary format. In fact, if the individual numbers were 4-octet integers, the efficient binary encoding would require 32 octets, whereas this particular XML row would still require only 22 octets. It is possible to tune a binary format to use a little less space, say, by stemming, but it is important to recognize that XML itself, text-based as it is, does provide a relatively efficient way to represent complex structures.
8.1.2 Software and Hardware Support
The text-based protocols used on the InternetSMTP for sending e-mail, HTTP for retrieving Web resources, FTP for downloading files, and TELNET for connecting to remote machinesare all text based, like XML, and they have suffered from the same complaints that XML faces. Because they are text, they require an initial parsing step that slows down processing.
However, as the Web grew in popularity, individual developers and networking companies started to make software and hardware especially designed to work with these higher-level protocols. For example, many, if not most routers, can read and understand HTTP as well as the lower-level TCP and IP protocols and can make more intelligent routing choices as a result. Hardware acceleration is available for creating and managing HTTP, and networking libraries are efficient and well debugged. It does not matter so much that HTTP adds a little parsing overhead, because modern software and hardware support more than cancels out that disadvantage.
The same process is starting to take place with XML. Although higher-level tools, such as XSLT engines, remain slow, the low-level tools, especially XML parsers, have become fast and robust. Someone parsing an XML file is taking advantage of thousands of hours of experimentation, debugging, profiling, and optimizing that highly competitive XML parser providersboth vendors and free-software developershave put into their products. Compared to custom-designed code to read a binary format, an XML parser is less likely to crash or fall into performance traps, such as unintentional buffer copying, and more likely to run fast and efficiently.
Furthermore, like HTTP, XML is starting to get hardware support. Several vendors, such as DataPower, Sarvega, and Reactivity, are releasing products for low-level XML parsing and, sometimes, for such higher-level operations as XSLT. This hardware still needs to be proven in the field and the market, but it suggests that the processing inefficiencies particular to XML will matter even less in the future than they do now.