Understanding Canonical XML
Infosets are only abstract formulations of the information in an XML document. So without reducing an XML document to its infoset, how can you actually approach the goal of being able to actually compare XML documents character by character? You can write your documents in canonical XML.
TIP
You can find a canonical XML tutorial at http://www.xfront.com/canonical/CanonicalXML.html.
Canonical XML is a companion specification to XML, and you can read all about it at http://www.w3.org/TR/xml-c14n. Canonical XML is a very strict XML syntax, which lets documents in canonical XML be compared directly.
Using this strict syntax makes it easier to see whether two XML documents are the same. For example, a section of text in one document might read Black & White, whereas the same section of text might read Black & White in another document, and even <![CDATA[Black & White]]> in another. If you compare those three documents byte by byte, they'll be different. But if you write them all in canonical XML, which specifies every aspect of the syntax you can use, these three documents would all have the same version of this text (which would be Black & White) and could be compared without problem.
As you might imagine, the canonical XML syntax is very strict; for example, canonical XML uses UTF-8 character encoding only, carriage-return linefeed pairs are replaced with linefeeds (that is, ), tabs in CDATA sections are replaced by spaces, all entity references must be expanded, and much more, as specified in http://www.w3.org/TR/xml-c14n.
TIP
In their canonical form, documents can be compared directly, and any differences will be readily apparent. Because canonical XML is intended to be byte-by-byte correct, it's often a good idea to use software to convert your XML documents to that form. One such package that will convert valid XML documents to canonical form comes with the XML for Java software that you can get free from IBM's AlphaWorks (http://www.alphaworks.ibm.com/tech/xml4j). The actual program is named DOMWriter, and it's part of the XML for Java package.
That completes today's discussion on constructing XML documents. We've covered everything we need to know before we start discussing how to create valid XML documentsand we're going to start doing that tomorrow.