- Why Is Canonical XML Needed?
- Canonical XML Terminology
- Canonical XML Example: Different but Equal
- Summary
- For Further Exploration
Canonical XML Example: Different but Equal
In this section, we'll examine two XML instances that are physically different and determine whether they are logically equivalent based on a comparision of their canonical forms. To accomplish this, we'll use a handy utility called xmlcanon, the Canonical XML Processor, from ElCel Technology, which generates the canonical form of a document; xmlcanon is freely available (see http://www.elcel.com/products/xmlcanonman.html). See this book's CD-ROM.
Listings 5-3 and 5-4 show the two original XML documents in their noncanonical form.
Listing 5-3 Input Variation 1 (catalog1.xml)
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE Disc:DiscountCatalog [ <!ELEMENT Disc:DiscountCatalog (Disc:category) > <!ATTLIST Disc:DiscountCatalog xmlns:Disc CDATA #FIXED "http://www.HouseOfDiscounts.com/namespaces/Discounts" > <!ELEMENT Disc:category (Disc:item+) > <!ATTLIST Disc:category name CDATA #REQUIRED > <!ELEMENT Disc:item (Disc:price, Disc:extra*) > <!ATTLIST Disc:item name CDATA #REQUIRED > <!ELEMENT Disc:price (#PCDATA) > <!ATTLIST Disc:price type (wholesale|retail) "wholesale"> <!ATTLIST Disc:price currency CDATA '$US' > <!ELEMENT Disc:extra (#PCDATA) > <!ENTITY internalEnt "Internal Entity Replacement Text" > <!ENTITY externalEnt SYSTEM "oz.txt"> ]> <!-- Comment outside doc root may or may not be discarded. --> <Disc:DiscountCatalog> <!-- Note quotes around 'Wild Animals' and 'Lion' on input. --> <Disc:category name='Wild Animals'> <Disc:item name = ' Lion'> <Disc:price type="wholesale">999.99</Disc:price> </Disc:item> <?somePI target1="foo" target2="bar" ?> <!-- Comment with entity ref &internalEnt; which won't expand. --> <Disc:item name="Tiger"> <Disc:price type="wholesale">879.99</Disc:price> <Disc:extra>©</Disc:extra> <Disc:extra>&internalEnt;</Disc:extra> <Disc:extra/> <!-- empty element --> </Disc:item> <Disc:item name="Bear"><Disc:price type="wholesale">1199.99</Disc:price> <Disc:extra>External entity replacement: &externalEnt;</Disc:extra> <Disc:extra> <![CDATA[ sale > "500.00" && sale < "2000.00" ? 'munchkin' : 'monkey' ]]> </Disc:extra></Disc:item> </Disc:category> </Disc:DiscountCatalog>
Listing 5-4 Input Variation 2 (catalog2.xml)
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE Disc:DiscountCatalog [ <!-- Different order of declarations in DTD --> <!ENTITY externalEnt SYSTEM "oz.txt"> <!ELEMENT Disc:DiscountCatalog (Disc:category) > <!ATTLIST Disc:DiscountCatalog xmlns:Disc CDATA #FIXED "http://www.HouseOfDiscounts.com/namespaces/Discounts" > <!ELEMENT Disc:item (Disc:price, Disc:extra*) > <!ATTLIST Disc:item name CDATA #REQUIRED > <!ELEMENT Disc:price (#PCDATA) > <!ATTLIST Disc:price type (wholesale|retail) "wholesale"> <!-- Inline in this version: ATTLIST Disc:price currency CDATA '$US' --> <!ELEMENT Disc:extra (#PCDATA) > <!ELEMENT Disc:category (Disc:item+) > <!ATTLIST Disc:category name CDATA #REQUIRED > ]> <!-- Comment outside doc root may or may not be discarded. --> <Disc:DiscountCatalog> <!-- Note quotes around 'Wild Animals' and 'Lion' on input. --> <Disc:category name="Wild Animals"> <Disc:item name=" Lion"> <Disc:price type="wholesale" currency="$US" >999.99</Disc:price> </Disc:item> <?somePI target1="foo" target2="bar" ?> <!-- Comment with entity ref &internalEnt; which won't expand. --> <Disc:item name="Tiger"> <Disc:price currency = "$US" type="wholesale" >879.99</ Disc:price> <Disc:extra>©</Disc:extra> <Disc:extra>Internal Entity Replacement Text</Disc:extra> <Disc:extra></Disc:extra> <!-- empty element --> </Disc:item> <Disc:item name="Bear"><Disc:price currency="$US" type="wholesale">1199.99</Disc:price> <Disc:extra>External entity replacement: &externalEnt;</Disc:extra> <Disc:extra> <![CDATA[ sale > "500.00" && sale < "2000.00" ? 'munchkin' : 'monkey' ]]> </Disc:extra></Disc:item> </Disc:category> </Disc:DiscountCatalog>
Some of the differences are fairly obvious, but others are subtle, so I've used the UNIX diff command to compare the two input documents.2 The comparison appears in Listing 5-5, in which < marks lines from variation 1 and > denotes lines from variation 2. The major differences between the two XML documents include:
The DTDs (internal subset) differ in order of declarations.
The DTD portion of variation 1 contains a default attribute declaration for currency, but variation 2 explicitly provides the values in the XML instance.
Variation 1 declares the internal entity internalEnt as the string "Internal Entity Replacement Text" but variation 2 explicity provides this string as CDATA.
The words Wild Animals and Lions appear with single quotes in variation 1 but double quotes in variation 2.
In variation 1, the empty element is shown as <Disc:extra/>, whereas it is expanded in variation 2.
There are many differences concerning the use of white space between the two versions.
Variation 1 is 1,735 bytes, whereas variation 2 is 1,834 bytes, as determined by the UNIX wc (word count) command.
Listing 5-5 UNIX-Style Differences between Variations 1 and 2 (diff-input.txt)
2d1 < 3a3,4 > <!-- Different order of declarations in DTD --> > <!ENTITY externalEnt SYSTEM "oz.txt"> 8,9d8 < <!ELEMENT Disc:category (Disc:item+) > < <!ATTLIST Disc:category name CDATA #REQUIRED > 14c13 < <!ATTLIST Disc:price currency CDATA '$US' > --- > <!-- Inline in this version: ATTLIST Disc:price currency CDATA '$US' --> 16,17c15,16 < <!ENTITY internalEnt "Internal Entity Replacement Text" > < <!ENTITY externalEnt SYSTEM "oz.txt"> --- > <!ELEMENT Disc:category (Disc:item+) > > <!ATTLIST Disc:category name CDATA #REQUIRED > 23,25c22,24 < <Disc:category name='Wild Animals'> < <Disc:item name = ' Lion'> < <Disc:price type="wholesale">999.99</Disc:price> --- > <Disc:category name="Wild Animals"> > <Disc:item name=" Lion"> > <Disc:price type="wholesale" currency="$US" >999.99</Disc:price> 28c27 < <?somePI target1="foo" target2="bar" ?> --- > <?somePI target1="foo" target2="bar" ?> 31,32c30,31 < <Disc:item name="Tiger"> < <Disc:price type="wholesale">879.99</Disc:price> --- > <Disc:item name="Tiger"> > <Disc:price currency = "$US" type="wholesale" >879.99</Disc:price> 34,35c33,34 < <Disc:extra>&internalEnt;</Disc:extra> < <Disc:extra/> <!-- empty element --> --- > <Disc:extra>Internal Entity Replacement Text</Disc:extra> > <Disc:extra></Disc:extra> <!-- empty element --> 38c37 < <Disc:item name="Bear"><Disc:price type="wholesale">1199.99</Disc:price> --- > <Disc:item name="Bear"><Disc:price currency="$US" type="wholesale">1199.99</Disc:price>
So, now that we have a good idea of how the two files differ, our next step is to generate the canonical form of each variation and compare the results. Although xmlcanon supports a number of useful options, for our purposes, we can use the default behavior.
C:\>xmlcanon catalog1.xml > canon1.xml C:\>xmlcanon catalog2.xml > canon2.xml
Listing 5-6 shows the canonical form of the first file (canon1.xml). There are several noteworthy aspects of the canonical form when compared to the input file catalog1.xml (Listing 5-3).
The entire DTD (an internal subset) has been stripped, along with the XML declaration.
All single quotes surrounding attribute values that have been replaced by double quotes, although single quotes appearing in comments have remained intact.
White space in start tags has been reduced (normalized), except where it occurred within attribute values, such as " Lion".
The default value for the currency attribute of the Disc:price element has been substituted.
White space after the target of the PI has been removed, but white space has been preserved in the content of the PI.
The entity reference &internalEnt; was not expanded within a comment, but was expanded in the Disc:extra element.
The character code © has been replaced with the copyright symbol.
The external entity &internalEnt; has been replaced by the content of the file oz.txt, which is just the string Oh, my! plus a newline.
The CDATA section has been replaced by characters and entity references.
The canonical form is only 1,107 bytes, due largely to the removal of the DTD and partially due to stripping unnecessary white space.
Listing 5-6 Canonical Form of Input Variation 1 (canon1.xml)
<!-- Comment outside doc root may or may not be discarded. --> <Disc:DiscountCatalog xmlns:Disc="http://www.HouseOfDiscounts.com/ namespaces/Discounts"> <!-- Note quotes around 'Wild Animals' and 'Lion' on input. --> <Disc:category name="Wild Animals"> <Disc:item name=" Lion"> <Disc:price currency="$US" type="wholesale">999.99</Disc:price> </Disc:item> <?somePI target1="foo" target2="bar" ?> <!-- Comment with entity ref &internalEnt; which won't expand. --> <Disc:item name="Tiger"> <Disc:price currency="$US" type="wholesale">879.99</Disc:price> <Disc:extra>©</Disc:extra> <Disc:extra>Internal Entity Replacement Text</Disc:extra> <Disc:extra></Disc:extra> <!-- empty element --> </Disc:item> <Disc:item name="Bear"><Disc:price currency="$US" type="wholesale">1199.99</Disc:price> <Disc:extra>External entity replacement: Oh, my! </Disc:extra> <Disc:extra> sale > "500.00" && sale < "2000.00" ? 'munchkin' : 'monkey' </Disc:extra></Disc:item> </Disc:category> </Disc:DiscountCatalog>
From a cygwin UNIX shell window, I used the diff command to verify that the two canonical forms were in fact identical:
$ diff canon1.xml canon2.xml
The diff command produced no output, which indicates that the inputs were identical.
ElCel's xmlcanon gives you the option of trying the original XML canonicalization method proposed by XML expert James Clark (http://www.jclark.com/xml/canonxml.html) in one of the working drafts of Canonical XML. While the canonical form that Clark's method produces is very different from that produced by the final W3C Recommendation, I verified that it too produced equivalent canonical forms for the two sample input files, using the command lines:
C:\>xmlcanon --method="JClark" catalog1.xml > clark1.xml C:\>xmlcanon --method="JClark" catalog2.xml > clark2.xml $ diff clark1.xml clark2.xml
Listing 5-7 shows clark1.xml, the canonical form first proposed by James Clark. The output that appears without line breaks has been wrapped to make it at least semireadable. It contains only 990 bytes.
Listing 5-7 James Clark's Canonical Form (clark1.xml)
<Disc:DiscountCatalog xmlns:Disc="http://www.HouseOfDiscounts.com/namespaces/Discounts"> <Disc:category name="Wild Animals"> <Disc:item name=" Lion"> <Disc:price currency="$US" type="wholesale">999.99</Disc:price> </Disc:item> <?somePI target1="foo" target2="bar" ?> <Disc:item name="Tiger"> <Disc:price currency="$US" type="wholesale">879.99</Disc:price> <Disc:extra>©</Disc:extra> <Disc:extra>Internal Entity Replacement Text</Disc:extra> <Disc:extra></Disc:extra> </Disc:item> <Disc:item name="Bear"><Disc:price currency="$US" type="wholesale">1199.99</Disc:price> <Disc:extra>External entity replacement: Oh, my! </Disc:extra> <Disc:extra> sale > "500.00" && sale < "2000.00" ? 'munchkin' : 'monkey' </Disc:extra></Disc:item> </Disc:category> </Disc:DiscountCatalog>
When we again use the UNIX diff command to compare clark1.xml and clark2.xml, we find no difference. Therefore, although the canonicalization method that produced these files in not the same as the one the W3C ultimately implemented, Clark's method also proves that our original two input files, catalog1.xml and catalog2.xml, have the same canonical form (i.e., they are logically equivalent despite many physical differences).