- Document Object Model (DOM)
- PHP and the DOM
- Traversing the DOM with PHP's DOM Classes
- Traversing the DOM with PHP's XPath Classes
- Manipulating DOM Trees
- DOM or SAX?
- Summary
Traversing the DOM with PHP's DOM Classes
Because PHP's DOM parser works by creating standard objects to represent XML structures, an understanding of these objects and their capabilities is essential to using this technique effectively. This section examines the classes that form the blueprint for these objects in greater detail.
DomDocument Class
A DomDocument object is typically the first object created by the DOM parser when it completes parsing an XML document. It may be created by a call to xmldoc():
$doc = xmldoc("<?xml version='1.0'?><element>potassium</element>");
Or, if your XML data is in a file (rather than a string), you can use the xmldocfile() function to create a DomDocument object:
$doc = xmldocfile("element.xml");
Treading the Right Path
If you're using Windows, you'll need to give xmldocfile() the full path to the XML file. Don't forget to include the drive letter!
When you examine the structure of the DomDocument object with print_r(), you can see that it contains basic information about the XML documentincluding the XML version, the encoding and character set, and the URL of the document:
DomDocument Object ( [name] => [url] => [version] => 1.0 [standalone] => -1 [type] => 9 [compression] => -1 [charset] => 1 )
Peekaboo!
You'll notice that many examples in this book (particularly in this chapter) use the print_r() function to display the structure of a particular PHP variable. In case you're not familiar with this function, you should know that it provides an easy way to investigate the innards of a particular variable, array, or object. Use it whenever you need to look inside an object to see what makes it tick; and, if you're feeling really adventurous, you might also want to take a look at the var_dump() and var_export() functions, which provide similar functionality.
Each of these properties provides information on some aspect of the XML document:
nameName of the XML document
urlURL of the document
versionXML version used
standaloneWhether or not the document is a standalone document
typeInteger corresponding to one of the DOM node types (see Table 3.1)
compressionWhether or not the file was compressed
charsetCharacter set used by the document
The application can use this information to make decisions about how to process the XML datafor example, as Listing 3.3 demonstrates, it may reject documents based on the version of XML being used.
Listing 3.3 Using DomDocument Properties to Verify XML Version Information
<?php // XML data $xml_string = "<?xml version='1.0'?><element>potassium</element>"; // create a DOM object if (!$doc = xmldoc($xml_string)) { die("Error in XML"); } // version check else if ($doc->version > 1.0) { die("Unsupported XML version"); } else { // XML processing code here } ?>
In addition to the properties described previously, the DomDocument object also comes with the following methods:
root()Returns a DomElement object representing the document element
dtd()Returns a DTD object containing information about the document's DTD
add_root()Creates a new document element, and returns a DomElement object representing that element
dumpmem()Dumps the XML structure into a string variable
xpath_new_context()Creates an XPathContext object for XPath evaluation
While parsing XML data, you'll find that the root() method is the one you use most often, whereas the add_root() and dumpmem() methods come in handy when you're creating or modifying an XML document tree in memory (discussed in detail in the "Manipulating DOM Trees" section).
X Marks the Spot
In case you're wondering, XPath, or the XML Path Language, provides an easy way to address specific parts of an XML document. The language uses directional axes, coupled with conditional tests, to create node collections matching a specific criterion, and also provides standard constructs to manipulate these collections.
PHP's XPath implementation is discussed in detail in the upcoming section titled "Traversing the DOM with PHP's XPath Classes."
In Listing 3.4, the variable $fruit contains the root node (the element named fruit).
Listing 3.4 Accessing the Document Element via the DOM
<?php // create a DomDocument object $doc = xmldoc("<?xml version='1.0' encoding='UTF-8'
standalone='yes'?><fruit>watermelon</fruit>"); // root node $fruit = $doc->root(); ?>
To DTD or Not to DTD
The dtd() method of the DomDocument object creates a DTD object, which contains basic information about the document's Document Type Definition. Here's what it looks like:
Dtd Object ( [systemId] => weather.dtd [name] => weather )
This DTD object exposes two properties: the systemId property reveals the filename of the DTD document, whereas the name property contains the name of the document element.
DomElement Class
The PHP parser represents every element within the XML document as an instance of the DomElement class, which makes it one of the most important in this lineup. When you view the structure of a DomElement object, you see that it has two distinct properties that represent the element name and type, respectively. You'll remember from Listing 3.2 that these properties can be used to identify individual elements and extract their values. Here is an example:
DomElement Object ( [type] => 1 [tagname] => vegetable )
A special note should be made here of the type property, which indicates the type of node under discussion. This type property contains an integer value mapping to one of the parser's predefined node types. Table 3.1 lists the important types.
Table 3.1 DOM Node Types
Integer |
Node type |
Description |
1 |
XML_ELEMENT_NODE |
Element |
2 |
XML_ATTRIBUTE_NODE |
Attribute |
3 |
XML_TEXT_NODE |
Text |
4 |
XML_CDATA_SECTION_NODE |
CDATA section |
5 |
XML_ENTITY_REF_NODE |
Entity reference |
7 |
XML_PI_NODE |
Processing instruction |
8 |
XML_COMMENT_NODE |
Comment |
9 |
XML_DOCUMENT_NODE |
XML document |
12 |
XML_NOTATION_NODE |
Notation |
If you plan to use the type property within a script to identify node types (as I will be doing shortly in Listing 3.5), you should note that it is considered preferable to use the named constants rather than their corresponding integer values, both for readability and to ensure stability across API changes.
The DomElement object also exposes a number of useful object methods:
children()Returns an array of DomElement objects representing the children of this node
parent()Returns a DomElement object representing the parent of this node
attributes()Returns an array of DomAttribute objects representing the attributes of this node
get_attribute()Returns the value of an attribute of this node
new_child()Creates a new DomElement object, and attaches it as a child of this node (note that this newly created node is placed at the end of the existing child list)
set_attribute()Sets the value of an attribute of this node
set_content()Sets the content of this node
Again, the two most commonly used ones are the children() and attributes() methods, which return an array of DomElement and DomAttribute objects, respectively. The get_attribute() method can be used to return the value of a specific attribute of an element (refer to Listing 3.8 for an example), whereas the new_child(), set_attribute(), and set_content() methods are used when creating or modifying XML trees in memory, and are discussed in detail in the section entitled "Manipulating DOM Trees."
Note that PHP's DOM implementation does not currently offer any way of removing an attribute previously set with the set_attribute() method.
Choices
Most of the object methods discussed in this chapter can also be invoked as functions by prefixing the method name with domxml and passing a reference to the object as the first function argument. The following snippets demonstrate this:
<?php // these two are equivalent $root1 = $doc->root(); $root2 = domxml_root($doc); // these two are equivalent $children1 = $root1->children(); $children2 = domxml_children($root2); ?>
Listing 3.5 demonstrates one of these in action by combining the children() method of a DomElement object with a recursive function and HTML's unordered lists to create a hierarchical tree mirroring the document structure (similar in concept, though not in approach, to Listing 2.5). At the end of the process, a count of the total number of elements encountered is displayed.
Listing 3.5 Representing an XML Document as a Hierarchical List
<?php // XML file $xml_file = "letter.xml"; // parse it if (!$doc = xmldocfile($xml_file)) { die("Error in XML document"); } // get the root node $root = $doc->root(); // get its children $children = get_children($root); // element counter // start with 1 so as to include document element $elementCount = 1; // start printing print_tree($children); // this recursive function accepts an array of nodes as argument, // iterates through it and prints a list for each element found function print_tree($nodeCollection) { global $elementCount; // iterate through array echo "<ul>"; for ($x=0; $x<sizeof($nodeCollection); $x++) { // add to element count $elementCount++; // print element as list item echo "<li>" . $nodeCollection[$x]->tagname; // go to the next level of the tree $nextCollection = get_children($nodeCollection[$x]); // recurse! print_tree($nextCollection); } echo "</ul>"; } // function to return an array of children, given a parent node function get_children($node) { $temp = $node->children(); $collection = array(); // iterate through children array for ($x=0; $x<sizeof($temp); $x++) { // filter out all nodes except elements // and create a new array if ($temp[$x]->type == XML_ELEMENT_NODE) { $collection[] = $temp[$x]; } } // return array containing child nodes return $collection; } echo "Total number of elements in document: $elementCount"; ?>
Listing 3.5 is fairly easy to understand. The first step is to obtain a reference to the root of the document tree via the root() method; this reference serves as the starting point for the recursive print_tree() function. This function obtains a reference to the children of the root node, processes them, and then calls itself again to process the next level of nodes in the tree. The process continues until all the nodes in the tree have been exhausted. An element counter is used to track the number of elements found, and to display a total count of all the elements in the document.
DomText Class
Character data within an XML document is represented by the DomText class. Here's what it looks like:
DomText Object ( [type] => 3 [content] => cabbages )
The type property represents the node type (XML_TEXT_NODE in this case, as can be seen from Table 3.1), whereas the content property holds the character data itself. In order to illustrate this, consider Listing 3.6, which takes an XML-encoded list of country names, parses it, and puts that list into a PHP array.
Listing 3.6 Using DomText Object Properties to Retrieve Character Data from an XML Document
<?php // XML data $xml_string = "<?xml version='1.0'?> <earth> <country>Albania</country> <country>Argentina</country> <!-- and so on --> <country>Zimbabwe</country> </earth>"; // create array to hold country names $countries = array(); // create a DOM object from the XML data if(!$doc = xmldoc($xml_string)) { die("Error parsing XML"); } // start at the root $root = $doc->root(); // move down one level to the root's children $nodes = $root->children(); // iterate through the list of children foreach ($nodes as $n) { // for each <country> element // get the text node under it // and add it to the $countries[] array $text = $n->children(); if ($text[0]->content != "") { $countries[] = $text[0]->content; } } // uncomment this line to see the contents of the array // print_r($countries); ?>
Fairly simplea loop is used to iterate through all the <country> elements, adding the character data found within each to the global $countries array.
Taking up Space
It's important to remember that XML, unlike HTML, does not ignore whitespace, but treats it as literal character data. Consequently, if your XML document includes whitespace or line breaks, PHP's DOM parser identifies them as text nodes, and creates DomText objects to represent them. This is a common cause of confusion for DOM newbies, who are often stumped by the "extra" nodes that appear in their DOM tree.
DomAttribute Class
A call to the attributes() method of the DomElement object generates an array of DomAttribute objects, each of which looks like this:
DomAttribute Object ( [name] => color [value] => green )
The attribute name can be accessed via the name property, and the corresponding attribute value can be accessed via the value property. Listing 3.7 demonstrates how this works by using the value of the color attribute to highlight each vegetable or fruit name in the corresponding color.
Listing 3.7 Accessing Attribute Values with the DomAttribute Object
<?php // XML data $xml_string = "<?xml version='1.0'?> <sentence> What a wonderful profusion of colors and smells in the market - <vegetable
color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,
<fruit color='green'>apples</fruit>, <vegetable
color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit> </sentence>"; // parse it if (!$doc = xmldoc($xml_string)) { die("Error in XML document"); } // get the root node $root = $doc->root(); // get its children $children = $root->children(); // iterate through child list for ($x=0; $x<sizeof($children); $x++) { // if element node if ($children[$x]->type == XML_ELEMENT_NODE) { // get the text node under it $text = $children[$x]->children(); $cdata = $text[0]->content; // check its attributes to see if "color" is present $attributes = $children[$x]->attributes(); if (is_array($attributes) && ($index =
is_color_attribute_present($attributes))) { // if it is, colorize the element content echo "<font color=" . $index . ">" . $cdata . "</font>"; } else { // else print it as is echo $cdata; } } // if text node else if ($children[$x]->type == XML_TEXT_NODE) { // simply print the content echo $children[$x]->content; } } // function to iterate through attribute list // and return the value of the "color" attribute if available function is_color_attribute_present($attributeList) { foreach($attributeList as $attrib) { if ($attrib->name == "color") { $color = $attrib->value; break; } } return $color; } ?>
There is, of course, a simpler way to do thisjust use the DomElement object's get_attribute() method. Listing 3.8, which generates equivalent output to Listing 3.7, demonstrates this alternative (and much shorter) approach.
Listing 3.8 Accessing Attribute Values (a Simpler Approach)
<?php // XML data $xml_string = "<?xml version='1.0'?> <sentence> What a wonderful profusion of colors and smells in the market - <vegetable
color='green'>cabbages</vegetable>, <vegetable color='red'>tomatoes</vegetable>,
<fruit color='green'>apples</fruit>, <vegetable
color='purple'>aubergines</vegetable>, <fruit color='yellow'>bananas</fruit> </sentence>"; // parse it if (!$doc = xmldoc($xml_string)) { die("Error in XML document"); } // get the root node $root = $doc->root(); // get its children $children = $root->children(); // iterate through child list for ($x=0; $x<sizeof($children); $x++) { // if element node if ($children[$x]->type == XML_ELEMENT_NODE) { // get the text node under it $text = $children[$x]->children(); $cdata = $text[0]->content; // check to see if element contains the "color" attribute if ($children[$x]->get_attribute("color")) { // "color" attribute is present, colorize text echo "<font color=" . $children[$x]->get_attribute("color") . ">" .
$cdata . "</font>"; } else { // otherwise just print the text as is echo $cdata; } } // if text node else if ($children[$x]->type == XML_TEXT_NODE) { // print content as is echo $children[$x]->content; } } ?>
A Composite Example
Now that you know how it works, how about seeing how it plays out in real life? This example takes everything you learned thus far, and uses that knowledge to construct an HTML file from an XML document.
I'll be using a variant of the XML invoice (Listing 2.21) from Chapter 2, adapting the SAX-based approach demonstrated there to the new DOM paradigm. As you'll see, although the two techniques are fundamentally different, they can nonetheless achieve a similar effect. Listing 3.9 is the marked-up invoice.
Listing 3.9 An XML Invoice (invoice.xml)
<?xml version="1.0"?> <invoice> <customer> <name>Joe Wannabe</name> <address> <line>23, Great Bridge Road</line> <line>Bombay, MH</line> <line>India</line> </address> </customer> <date>2001-09-15</date> <reference>75-848478-98</reference> <items> <item cid="AS633225"> <desc>Oversize tennis racquet</desc> <price>235.00</price> <quantity>1</quantity> <subtotal>235.00</subtotal> </item> <item cid="GT645"> <desc>Championship tennis balls (can)</desc> <price>9.99</price> <quantity>4</quantity> <subtotal>39.96</subtotal> </item> <item cid="U73472"> <desc>Designer gym bag</desc> <price>139.99</price> <quantity>1</quantity> <subtotal>139.99</subtotal> </item> <item cid="AD848383"> <desc>Custom-fitted sneakers</desc> <price>349.99</price> <quantity>1</quantity> <subtotal>349.99</subtotal> </item> </items> <delivery>Next-day air</delivery> </invoice>
Listing 3.10 parses the previous XML data to create an HTML page, suitable for printing or viewing in a browser.
Listing 3.10 Formatting an XML Document with the DOM
<html> <head> <basefont face="Arial"> </head> <body bgcolor="white"> <font size="+3">Sammy's Sports Store</font> <br> <font size="-2">14, Ocean View, CA 12345, USA
http://www.sammysportstore.com/</font> <p> <hr> <center>INVOICE</center> <hr> <?php // arrays to associate XML elements with HTML output $startTagsArray = array( 'CUSTOMER' => '<p> <b>Customer: </b>', 'ADDRESS' => '<p> <b>Billing address: </b>', 'DATE' => '<p> <b>Invoice date: </b>', 'REFERENCE' => '<p> <b>Invoice number: </b>', 'ITEMS' => '<p> <b>Details: </b> <table width="100%" border="1" cellspacing="0"
cellpadding="3"><tr><td><b>Item description</b></td><td><b>Price</b></td><td><b>
Quantity</b></td><td><b>Sub-total</b></td></tr>', 'ITEM' => '<tr>', 'DESC' => '<td>', 'PRICE' => '<td>', 'QUANTITY' => '<td>', 'SUBTOTAL' => '<td>', 'DELIVERY' => '<p> <b>Shipping option:</b> ', 'TERMS' => '<p> <b>Terms and conditions: </b> <ul>', 'TERM' => '<li>' ); $endTagsArray = array( 'LINE' => ',', 'ITEMS' => '</table>', 'ITEM' => '</tr>', 'DESC' => '</td>', 'PRICE' => '</td>', 'QUANTITY' => '</td>', 'SUBTOTAL' => '</td>', 'TERMS' => '</ul>', 'TERM' => '</li>' ); // array to hold sub-totals $subTotals = array(); // XML file $xml_file = "/home/sammy/invoices/invoice.xml"; // parse document $doc = xmldocfile($xml_file); // get the root node $root = $doc->root(); // get its children $children = $root->children(); // start printing print_tree($children); // this recursive function accepts an array of nodes as argument, // iterates through it and: // - marks up elements with HTML // - prints text as is function print_tree($nodeCollection) { global $startTagsArray, $endTagsArray, $subTotals; foreach ($nodeCollection as $node) { // how to handle elements if ($node->type == XML_ELEMENT_NODE) { // print HTML opening tags echo $startTagsArray[strtoupper($node->tagname)]; // recurse $nextCollection = $node->children(); print_tree($nextCollection); // once done, print closing tags echo $endTagsArray[strtoupper($node->tagname)]; } // how to handle text nodes if ($node->type == XML_TEXT_NODE) { // print text as is echo($node->content); } // PI handling code would come here // this doesn't work too well in PHP 4.1.1 // see the sidebar entitled "Process Failure" // for more information } } // this function gets the character data within an element // it accepts an element node as argument // and dives one level deeper into the DOM tree // to retrieve the corresponding character data function getNodeContent($node) { $content = ""; $children = $node->children(); if ($children) { foreach ($children as $child) { $content .= $child->content; } } return $content; } ?>
Figure 3.2 shows what the output looks like.
Figure 3.2 Sammy's Sports Store invoice.
As with the SAX example (refer to Listing 2.23), the first thing to do is define arrays to hold the HTML markup for specific tags; in Listing 3.10, this markup is stored in the $startTagsArray and $endTagsArray variables.
Next, the XML document is read by the parser, and an appropriate DOM tree is generated in memory. An array of objects representing the first level of the treethe children of the root nodeis obtained and the function print_tree() is called. This print_tree() function is a recursive function, and it forms the core of the script.
The print_tree() function accepts a node list as argument, and iterates through this list, examining each node and processing it appropriately. As you can see, the function is set up to perform specific tasks, depending on the type of node:
If the node is an element, the function looks up the $startTagsArray and $endTagsArray variables, and prints the corresponding HTML markup.
If the node is a text node, the function simply prints the contents of the text node as is.
Additionally, if the node is an element, the print_tree() function obtains a list of the element's childrenif any existand proceeds to call itself with that node list as argument. And so the process repeats itself until the entire tree has been parsed.
As Listing 3.10 demonstrates, this technique provides a handy way to recursively scan through a DOM tree and perform different actions based on the type of node encountered. You can use this technique to count, classify, and process the different types of elements encountered (Listing 3.5 demonstrated a primitive element counter); or even construct a new tree from the existing one.
Process Failure
If you've been paying attention, you will have noticed that the XML invoice in Listing 3.9 is not exactly the same as the one shown in Listing 2.21. Listing 2.21 included an additional processing instruction (PI), a call to the PHP function displayTotal(), which is missing in Listing 3.9.
Why? Because the DOM extension that ships with PHP 4.1.1 has trouble with processing instructions, and tends to barf all over the screen when it encounters one. Later (beta) versions of the extension do, however, include a fix for the problem.