Understanding XML Infosets
The inspiration behind both XML infosets (formally named XML information sets) and canonical XML is to make handling the data in XML documents easier. Reducing an XML document down to its infoset is intended to make comparisons between all kinds of XML documents easier by presenting the data in those documents in a standard way. You can find the official XML Information Set specification at http://www.w3.org/TR/xml-infoset.
To understand what infosets are and what they're used for, imagine searching for data on the World Wide Web. You might want to search for a particular topic, such as XML, and you would turn up millions of matches. How could you possibly write software to compare those documents? The data in those documents isn't stored in any way that's directly comparable.
That's where infosets come in, because the idea is to regularize how data is stored in an XML document that, ultimately, is designed to let you work with thousands of such documents. The idea behind infosets is to set up an abstract way of looking at an XML document that allows it to be compared to others. (Note that documents need to be well-formed to have an infoset.)
An XML infoset can contain fifteen different types of information items:
A document information item
Element information items
Attribute information items
Processing instruction information items
Reference to skipped entity information items
Character information items
Comment information items
A document type declaration information item
Entity information items
Notation information items
Entity start marker information items
Entity end marker information items
CDATA start marker information items
CDATA end marker information items
Namespace declaration information items
So what software works with infosets? None, reallyinfosets are primarily theoretical constructs, and the infoset specification is mostly designed to provide a set of definitions that other XML specifications can use when they need to refer to the information in an XML document. Although the term infoset has entered common usage as a way to refer to the information in an XML document, it's not a specific enough specification to allow any real implementation. The closest you can come these days to truly regularizing the data in XML documents to make it easy to compare them is to use canonical XML, coming up next.