PHP and the Document Object Model (DOM)
"A fool sees not the same tree that a wise man sees."
~William Blake
If you've been paying attention, you now know the basics of parsing XML with PHP. As Chapter 2, "PHP and the Simple API for XML (SAX)" demonstrated, it's pretty simplewhip up some XML data and mix in a few callback functions. It's a simple yet effective recipe, and one that can be used to great effect for the rapid development of XML-based applications.
That said, although the event-driven approach to XML parsing is certainly popular, it's not the only option available. PHP also allows you to parse XML using the Document Object Model (DOM), an alternative technique that allows developers to create and manipulate a hierarchical tree representation of XML data for greater flexibility and ease of use.
In this chapter, this tree-based approach is explored in greater detail. First it is put under the microscope to see exactly how it works and then PHP's implementation of the DOM is introduced. The various methods exposed by PHP to simplify interaction with the DOM are also examined, together with examples and code listings that demonstrate its capabilities.
Both tree- and event-based approaches have significant advantages and disadvantages, and these can impact your choice of technique when implementing specific projects. To that end, this chapter also includes a brief discussion of the pros and cons of each approach in the hope that it will assist you in making the right choice for a particular project.
Let's get started!
Document Object Model (DOM)
The Document Object Model (DOM) is a standard interface to access and manipulate structured data.
As the name suggests, it does this by modeling, or representing, a document as a hierarchical tree of objects. A number of different object types are defined in the W3C's DOM specification; these objects expose methods and attributes that can be used by the application layer to navigate and process the DOM tree, exploit the relationships between the different branches of the tree, and extract information from it.
The W3C's DOM specification defines a number of different objects to represent the different structures that appear within an XML document. For example, elements are represented by an Element object, whereas attributes are represented by Attr objects.
Each of these different object types exposes specific methods and properties. Element objects expose a tagName property containing the element name and getAttribute() and setAttribute() methods for attribute manipulation, whereas Attr objects expose a value property containing the value of the particular attribute. These methods and properties can be used by the application layer to navigate and process the DOM tree, exploit the relationships between the different branches of the tree, and extract information from it.
The very first specification of the DOM (DOM Level 1) appeared on the W3C's web site in October 1998, and simply specified the "core" features of the DOMthe basic objects and the interfaces to them. The next major upgrade, DOM Level 2, appeared in November 2000; it examined the DOM from the perspective of core functions, event handling, and document traversal. DOM Level 3, which is currently under development, builds on past work, and incorporates additions and changes from other related technologies (XPath, abstract schemas, and so on).
As a standard interface to structured data, the DOM was designed from the get-go to be platform- and language-independent. It can be (and is) used to represent structured HTML and XML data, with DOM (or DOM-based) implementations currently available for Java, JavaScript, Python, C/C++, Visual Basic, Delphi, Perl, SMIL, SVG, and PHP. (The PHP implementation is discussed in detail in the next section.)
In order to better understand how the DOM works, consider Listing 3.1.
Listing 3.1 A Simple XML Document
<?xml version="1.0"?> <sentence>What a wonderful profusion of colors and smells in the market
<vegetable color='green'>cabbages</vegetable>, <vegetable
color='red'>tomatoes</vegetable>, <fruit color='green'>apples</fruit>,
<vegetable color='purple'>aubergines</vegetable>, <fruit
color='yellow'>bananas</fruit></sentence>
Once a DOM parser chewed on this document, it would spit out the tree structure shown in Figure 3.1.
Figure 3.1 A DOM tree.
As you can see, the parser returns a tree containing multiple nodes linked to each other by parent-child relationships. Developers can then write code to move around the tree, access node properties, and manipulate node content.
This approach is in stark contrast to the event driven approach you studied in Chapter 2, "PHP and the Simple APIfor XML (SAX)." A SAX parser progresses sequentially through a document, firing events based on the tags it encounters and leaving it to the application layer to decide how to process each event. A DOM parser, on the other hand, reads the entire document into memory, and builds a tree representation of its structure; the application layer can then use standard DOM interfaces to find and manipulate individual nodes on this tree, in a non-sequential manner.