Home > Articles

This chapter is from the book 

Understanding XML Infosets

The inspiration behind both XML infosets (formally named XML information sets) and canonical XML is to make handling the data in XML documents easier. Reducing an XML document down to its infoset is intended to make comparisons between all kinds of XML documents easier by presenting the data in those documents in a standard way. You can find the official XML Information Set specification at http://www.w3.org/TR/xml-infoset.

To understand what infosets are and what they're used for, imagine searching for data on the World Wide Web. You might want to search for a particular topic, such as XML, and you would turn up millions of matches. How could you possibly write software to compare those documents? The data in those documents isn't stored in any way that's directly comparable.

That's where infosets come in, because the idea is to regularize how data is stored in an XML document that, ultimately, is designed to let you work with thousands of such documents. The idea behind infosets is to set up an abstract way of looking at an XML document that allows it to be compared to others. (Note that documents need to be well-formed to have an infoset.)

An XML infoset can contain fifteen different types of information items:

  • A document information item

  • Element information items

  • Attribute information items

  • Processing instruction information items

  • Reference to skipped entity information items

  • Character information items

  • Comment information items

  • A document type declaration information item

  • Entity information items

  • Notation information items

  • Entity start marker information items

  • Entity end marker information items

  • CDATA start marker information items

  • CDATA end marker information items

  • Namespace declaration information items

So what software works with infosets? None, reallyβ€”infosets are primarily theoretical constructs, and the infoset specification is mostly designed to provide a set of definitions that other XML specifications can use when they need to refer to the information in an XML document. Although the term infoset has entered common usage as a way to refer to the information in an XML document, it's not a specific enough specification to allow any real implementation. The closest you can come these days to truly regularizing the data in XML documents to make it easy to compare them is to use canonical XML, coming up next.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.