PHP and XML
Note: If you're already familiar with the basic concepts of XML, you can safely skip the next sections giving a short introduction to XML, and continue directly with PHP and Expat.
What Is XML?
XML (Extensible Markup Language) is a meta markup language for documents containing structured information. Let's try to explain it word by word in plain English:
-
XML is extensible. Take HTML: the tag <h1> always denotes a first-level heading. In XML, by contrast, the tag means nothing until you give it a meaning with an accompanying rule, the Document Type Definition (DTD).
-
XML is a markup language. Just as HTML should, theoretically, XML does not provide layout information to the processing application.
-
XML is a meta language. XML doesn't have a fixed tag setit provides a facility to define tags.
-
XML works with documents. Documents. As in not limited to files! Documents can come from a database, over the network, or indeed from files.
-
XML defines structured information. It arranges single parts of data in a larger body and gives it a contextual meaning and a structural relationship.
Structured Information
There's one key concept you'll need to understand when talking about XML: structured documents ormore eloquentlystructured information markup. Structured markup explicitly defines the structure and semantic content (the contextual meaning) of a document. It doesn't influence the way in which the document will appear to the readerthe interpretation of the data (parsing, layout, etc.) is completely left to the processing application. Take the HTML <p> (paragraph) tag: It denotes multiple sentences belonging together to form a logical unit. The tag per se doesn't imply how the paragraph should be rendered in the browser; the browser could insert a blank line before or after, indent the first line in the paragraph, or add ornamental borders around it. This is logical markupthe style information is hardcoded into the browser. XML documents are compounded of such logical markup. As in HTML, tags are used to identify the markup information. But in XML, there are no visual elements as in HTML (think of <font>)it's restricted to logical markup. There's no way to specify a word as italic in XML. You can only mark it for its semantic meaning, for example with <emphasis>.
So much for the markupwhere's the structure? XML tags can be nested and have a contextual statethat is, it's important where they appear in a document. A tag combination <chapter><title> is treated differently than <book><title>. There's no limitation on the number of nested elements in the XML specificationthe only requirement is that all elements must originate in one root element.
XML's Relatives
The ancestor of XML is SGML. Since it became an ISO standard in 1986, SGML (Standard Generalized Markup Language) has been used to maintain structured documents by large corporations in all industries. However, SGML is a complex standard that's difficult to support in applications. Most SGML applicationseditors, storage servers, transformation toolsare therefore quite expensive, often costing well above $10,000.
HTML, on the other hand, has wide industry support and is used on millions of Web sites. It defines a simple type of document for a common class of short articles, with headings, paragraphs, lists, illustrations, and some provision for hypertext and multimedia. But it's very limited regarding flexibility and extensibility. The tags and semantics are fixedyou can't define your own tag for an entry in a Table of Contents. Neither is it suited for media other than computer interfacesif you ever tried to print articles distributed to multiple files, you know what this means. The very open specification led to a fragmentation with multiple different implementations. As you know, it's an art per se to write browser-neutral HTML.
So there was a need to create a new format allowing structured documents to be used over the Web. XML was designed to overcome the limitations in the only viable alternatives, SGML and HTML.
The design goals of XML had some clearly defined points:
-
It must be easy to useboth for users and for developers implementing XML parsers. The complexity of SGML is a constraint that needs to be removed.
-
XML must be open to support a wide variety of applications and subprotocols. The dependence on a single, inflexible document type as with HTML needed to be eliminated.
-
It requires a strict syntax. Optional features lead to compatibility problems when users want to share documents. There was the constant fear that the same could happen as happened with HTMLmultiple competing and incompatible implementations.
-
It must be compatible with SGML. Members of the development committee were also involved in SGML efforts and had legacy data contained in SGML systems.
The development resulted in a clear specification approved by the World Wide Web Consortium (W3C) as the recommendation Extensible Markup Language (XML) 1.0 from February 10, 1998.
XML is different from SGML: XML strips out a large number of SGML's more complex and less-used features and creates a new reduced SGML-based application. Because it's a subset of SGML, you can read an XML document with any SGML-compliant system. Every valid XML document is a valid SGML element.
XML is different from HTML: Apart from removing HTML misconceptions, it has important syntactical differences. Plus, XML is fully Unicode-ready; tags, attributes and contents can be in any string encoding defined by Unicode.
Let's look at a short excerpt from the source code of this book:
<title>Cutting-Edge Applications</title> <abstract> <para> If you realize that all things change, there is nothing you will try to hold on to. </para> </abstract>
Here you see the tags in use, providing for structured and logical markup. In contrast to HTML,
-
Tags are case sensitive.
-
Whitespace is significant.
-
Opening tags must always have a matching closing tag or be self-closing (for example, <xref/>).
-
Documents can have an arbitrary valid Document Type Definition.
Thus we can happily summarize: XML removes the enormous complexity of SGML, while still providing all necessary features for structural markup, including the definition of custom document types.
XML's Advantages
But why XML? With all those formal definitions and fact sheets, developers sometimes don't see the usefulness for their daily activities at first. Indeed, why use XML and not Word or Notes? Or your own proprietary storage format? Or a relational database?
The main argument against proprietary formats is just that: They're proprietary. Data that's designed to be used on a heterogeneous network such as the Internet has to be usable on all types of computers connected to it. XML is built out of plain text (as opposed to the binary format of most proprietary applications), making it supportable by all current computing platforms. Besides, proprietary data formats are often (for example in public bodies) just not an option: You don't want to rely on the mercy of a single vendor who could change the format at will, or even abandon it altogether. XML is license-free, vendor-neutral, and platform-independent.
While XML provides means for structured content, it presents a different (but not necessarily opposing) view on content than relational database systems. XML doesn't provide a relational model. It allows unlimited nested levels, which could not be handled by a database system. On the other hand, it misses features found in an RDBMS, such as stringent field types, constraints, keys, and so on. Of course, there are similarities in the two concepts and there is indeed development going on to create a SQL-like query language for XML documents. Anyway, the success of XML shouldn't make you forget the usefulness of the traditional RDBMS; they provide many important processing features that could hardly be modeled in XML, and they're optimized for speed from the ground up.
The overall and killer advantage of XML is the separation of logical structure from layout. By having your documents in XML, you can transform them into any representation you want: HTML, PostScript, PDF, RTF, plain text, audio, Braillefrom one single source. And as XML (plain text) documents can be parsed with your favorite scripting language, it's easy to change hyperlinks dynamically, change element contents, or associate structures with a database.
And if you're still not convinced, review all those Document Type Definitions that are being developed or are already in use. XML itself is mostly an "under the hood" technologythe meat is the applications that use XML.
What Is XML Used For?
As a structured information markup language, XML is of course used in content management systems, archiving solutions, and corporate document repositories. But plenty of other XML applications and subprotocols exist. Due to the open nature of the standard, DTDs have been developed at a fast pace.
DocBook
The DocBookX DTD is a very popular set of tags for describing books, articles, and other prose documents, particularly technical documentation. It was originally developed in 1991 by the publisher O'Reilly as an SGML DTD for in-house use. It soon became popular with authors and spread to other publishing houses, a change embraced by O'Reilly, which handed over further development to the Davenport Group. In mid-1998, OASIS (Organization for the Advance of Structured Information Standards) officially took over the maintenance of DocBook. When XML became increasingly popular, an unofficial XML version (3.1) was created by Norman Walsh; work is currently underway to transform this to an official releaseDocBook 5 will most probably come in SGML and XML flavors.
When we started writing this book, it was clear that we wanted to use an open format such as XML. The DocBook DTD was consequently chosen because it offered all the features we would ever need. All the elements typically used in technical writing are present and, to tell you the truth, even very esoteric ones are includedor have you ever seen a MouseButton element (from the quick reference: The conventional name of a mouse button) in your word processor?
XML and DocBook offer some clear advantages to us. We can use CVS as a version control tool for both the PHP examples and the book files. Transformation to HTML is easy, either with PHP or using a style sheet processor like James Clark's XT. And editing is very comfortable, thanks to SoftQuad's XMetaL, which allows intuitive visual editing by using Cascading Style Sheets (CSS) for the display in the authoring environment, as shown in Figure 7.3.
Figure 7.3. SoftQuad's XMetaL XML authoring environment, used for writing this book.
WMLWireless Markup Language
WML is another Document Type Definition that has quickly become an industry standard. It's intended for use in specifying content and user interfaces for wireless devices such as mobile phones or Personal Digital Assistants. These devices have some common constraints, which make HTML a bad choice for a markup language:
-
Small and low-resolution graphical displays
-
Limited user interaction
-
Narrow-band network access (for now)
-
Limited computational resources
WML addresses these issues. It divides content into small pieces ("cards") and organizes them in larger information units ("decks"). To avoid continuous network access, WML defines a set of client-side scripting procedures in XML, for example the ability to set and access variables on the client computer. Because of limited screen real estate, creating meaningful navigation paths is especially difficult on portable devices. WML explicitly requires the user agentthe WAP browserto have a navigation history and enables WML documents to make use of it, thus freeing the author from some of the responsibility and delegating it to the user agent.
RDFResource Description Framework
The RDF specification defines a language to store meta information about Web resources in an XML format. The Web as it is, with its millions of HTML pages, is very difficult to process by automated machines like spiders or robots. Search engines are hitting their limit every day, and even the most clever algorithms don't guarantee meaningful search results, as anyone using the Web for professional research knows. Web pages can only be full-text searchedwhich is a very limited searching method.
Current HTML allows primitive storing of meta data about a document. As you may know, meta tags can be used to denote keywords for a document, a short summary, and author information. But what if you want to store the publication history of the document? Information about the editors? Any bibliographer will laugh at HTML's meta tags.
In 1998, the W3C formed a committee to research a format for defining meta data and released the Resource Description Framework (RDF) as a recommendation on February 22, 1999.
RDF extends the format originally used for PICS, a content rating system, and is more and more replacing the Dublin Core Metadata for Resource Discovery standard, another methodology for classifying meta data. RDF has quickly become accepted as a standard mechanism for the global exchange of meta data on the Internet.
XML Documents
XML documents consist of markup and content (called character data in XML terms) in the Unicode character set. There are different types of markup, which we'll introduce in the following overview.
Elements
Elements will look familiar to anyone who has worked with HTML. They denote the meaning of a content section. XML cannot contain elements with no closing tag (HTML's <img>, for example), but has a distinct notation to identify empty tags:
<xref linkend="end"/>
Keep in mind that the nesting of tags is significantimproperly nested tags will lead to badly formed documents.
Attributes
Elements can have attributes. Attributes are name/value pairs that occur within the tags after the element name and specify a property of an element. Attribute values must be contained in quotes. No attribute name may appear more than once in the same tag.
Any XML document can optionally (and regardless of the Document Type Definition) have two standard attributes: xml:lang and xml:space. The xml:lang attribute was defined because language independence is one of XML's most important goals.
Without knowing what language a text is written in, it's impossible for an application to display, spell-check, or index it. XML's great Unicode support wouldn't be of any help if the author couldn't assign a language tag to a particular part of a document. Thus the xml:lang attribute was introduced:
<p>Worldwide declarations of love</p> <p xml:lang="It">Ti amo.</p> <p xml:lang="De">Ich liebe Dich.</p> <p xml:lang="X-Klingon">qabang</p>
The language identifier is one of the following:
-
A two-letter ISO 639 language code
-
A language code registered with the Internet Assigned Numbers Authority (IANA); these begin with the prefix "i-" (or "I-")
-
A user-defined code, prefixed with "x-" (or "X-")
The other standard attribute, xml:space, isn't as straightforward to understand and use. As mentioned earlier, whitespace is significant in XMLit will be passed to the processing application. But after having read our Coding Style guidelines, you know that whitespace is important to structure and indent code to improve readability. This way it's used for laying out the markup, but it's of no importance for the markup itself or for the character data. On the other hand, an author may well intend whitespace to be preserved.
Because there are these two conflicting views on the subject, the XML committee introduced the xml:space attribute that controls the behavior of whitespace. It can only take two values: preserve or default. On any element that includes the attribute xml:space="preserve", whitespace is treated as "significant" and passed to the processing application as is. The default value tells the application that the application's default processing should be applied. Both standard attributes are inherited to sub-elements until they are explicitly reset in an element.
Note: An XML processor is the program used to read XML documents. The XML processor makes it possible for an application to access the structure and content of an XML document. Throughout this book, the terms XML processor and XML parser refer to the same kind of software.
Processing Instructions
Another "element" type you'll find in XML documents is the processing instruction, or PI. PIs are used to define parts in a document that should not be interpreted by the regular parser engine but instead by a specialized processing handler. They consist of <? and a target name used to identify the application to which the instruction is directed. The long PHP tag (<?php) is of course such a PI and can be used in XML documents to mark PHP code.
Note: In order to be XML-compliant, you have to set the short_tags directive in your PHP configuration to Off and use the long opening tag <?php consistently. The short opening tag would confuse XML, as it wouldn't be a valid processing instruction. On the other hand, tags like <xml would interfere with PHP, as PHP would think of the xml as code, and produce a parse error accordingly.
Entities
Any text that's not markup constitutes the character data of the document. Within this content, an author needs a way to include special characters like < or > that normally would introduce start or end markup sections. Similarly to HTML, XML knows the notation of entities. Five entities are predefined:
Entity | Character | Symbol |
---|---|---|
< | < | less than |
> | > | greater than |
& | & | ampersand |
" | " | double quote |
' | ' | single quote (apostrophe) |
Note: If you use a Document Type Definition, these entities need to be declared if you want to use them.
Using character references, you can insert any arbitrary Unicode character into your document. They consist of the normal notation of references, but with a pound sign (#) following the ampersand. After that, either a decimal or a hexadecimal reference to the Unicode position is inserted. For example, both ℞ and ∞ refer to the infinity sign (∞). Entities are not limited to a single character, though; they can be of any length. For example, a DTD could define an entity &footer; to contain "Copyright (c) 2000 New Riders."
Comments
XML uses the same notation for comments as HTML: <!--comment-->. Comments can contain any data except the literal string -- and can be placed between markup entries anywhere in your document. The XML specification explicitly states that comments are not part of a document's contentsa parser is not required to pass them to the processing application. This means you can't use comments for hidden instructions or the like, as you might be used to doing from HTML (think of using comment tags for hiding JavaScript from older browsers).
CDATA Sections
One special type of content is CDATA sections. As soon as you try to embed larger sections of code (containing many occurrences of < or &) into an XML document, you'll find the standard method of referencing special characters through entities awkward. HTML has the <pre> tag to turn off markup interpretation for a sectionbut as XML doesn't know any built-in tags, that's out of our reach. To overcome this, you can mark sections in XML as CDATA, using this construct:
<![CDATA[ print("<a href="script.php3?foo=bar&baz=foobar"); ]]>
Within a CDATA section, all characters can occur, except for the ]]> sequence.
Document Prologue
Note: Although prolog is the spelling in the official specs, our editor prefers the Americanized (and possibly arcane) spelling prologue. XML documents should (but don't have to) begin with an XML declaration that specifies the version of XML being used. This version information is part of the document prologue:
<?xml version="1.0"?> <greeting>Hello, world! </greeting>
By having this information at the top of a document, a processor can decide whether it can handle the document's version of XML. It's also useful as a method to identify the document's type; just as #!/bin/sh in the head of a file declares it to be a shell script, the XML declaration identifies an XML document.
The second important part of the document prologue is the document type declaration. Don't confuse this with Document Type Definition (DTD)the document type declaration contains or points to a DTD! The DTD consists of markup declarations that provide a "grammar" for XML documents. The document type declaration can point to an external DTD, contain the markup declarations directly, or both. The DTD for a document consists of both subsets taken together. Here's an example of a document type declaration:
<!DOCTYPE book SYSTEM "docbookx.dtd">
This document type declaration has the name book and points to an external DTD named docbookx.dtd. It has no inline DTD.
If a document contains the full DTD and no external entities, it's a called a stand-alone document and marked as such in the XML declaration:
<?xml version="1.0" standalone='yes'?>
This can be useful for some applications; for example, for delivery of documents over a network, when you want to open only a single document stream. Note that even XML documents with external DTDs can be converted to stand-alone documents by importing the DTD and external entities into the document prologue.
Document Structure
Now you know all the pieces that form an XML document: elements (with attributes), processing instructions, entities, comments, and CDATA sections. But how are these pieces grouped together to form a meaningful XML document?
The XML specification only defines a very generic document structure. It says that each well-formed document has these qualities (more about what "well-formed" means later):
-
May have a document prologue identifying the XML version and DTD.
-
Must have exactly one root element and an arbitrary number of elements below the root.
-
May have miscellaneous stuff after that.
The last part, "miscellaneous stuff," is referenced in a wry tone hereit's considered by many people to be a design error of XML. It makes parsing XML documents potentially much harder, because you can't rely on the document end being the closing root element. When parsing a document over a network connection, for example, you can't close the connection after having received the closing root elementyou must wait until the server closes the connection on its own, as there may still be more "miscellaneous" content to consider.
But nothing was said yet about the syntax and structure of the thing that supposedly is responsible for the whole magic of XML: the Document Type Definition. Indeed, it's the DTD that gives meaning to an XML document; it defines its syntax, the sequence and nesting of tags, possible element attributes, entitiesin short, the whole grammar. Writing complex DTDs is no easy task and whole books have been written to cover the subject. Because as an XML user you usually don't need to deal with this task directly, we won't cover this topic here. Instead, we'd like to look at another XML concept that may be more important in your daily work.
XML Namespaces
You've seen some different XML applications (Document Type Definitions) and what they're used for. But what if you want to create a single XML document containing elements from two different DTDs? For example, the <part> element could mean a book part in one DTD and a manufacturing part in another. Without a way to separate these two namespaces, the two element names would clash. How could these distinct elements be identified? You need to associate an identifier with the element, for example <part namespace = "book"> or, if you want to avoid attributes, <book:part> and <manufacturing:part>.
The W3C learned early about this shortcoming in XML and introduced a new specification: Namespaces in XML, published as a Recommendation on January 14, 1999.
XML namespaces provide a method for having multiple namespaces, identified by Uniform Resource Identifiers (URI), in one XML document. The Resource Description Framework DTD uses this method. Look at the following example from the RDF specification:
<?xml version="1.0"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:s="http://description.org/schema/"> <rdf:Description about="http://www.w3.org/Home/Lassila"> <s:Creator>Ora Lassila</s:Creator> </rdf:Description> </rdf:RDF>
This defines two namespaces, one named rdf and one named s. After the definition, a namespace is referenced by prefixing it (concatenated with a colon) to an element name, thus effectively avoiding the collision of different logical meanings and syntactical definitions.
Note: The URI in a namespace identifier is not a DTD. It would of course be nice to be able to point to different DTDs using XML namespaces, but there are currently many technical problems with this approachthis is being addressed by the W3C in the XML Schema definition, which is under development at the time of this writing.
EBNFOr "What the Heck Is That Again?"
As a Web developer, you'll frequently be faced with the task of reading specificationswhether project specs, formal language definitions, or standards whitepapers. When reading some of the specifications from the W3C (the most well-known are the HTML and XML documents, probably), you'll stumble across a strange mixture of characters that presumably form a grammar definition.
document ::= prolog element Misc*
This is the very first syntax definition in the XML specification and defines the basic structure of an XML document. The notation used is called Extended Backus-Naur Form, or EBNF for short. Understanding the formal specifications will get a lot easier once you understand the basics of EBNF.
EBNF is a formal way to define the syntax of a programming language so that there's no ambiguity left as to what's valid or allowed. It's also used in many other standards, such as protocol or data formats and markup languages like XML and SGML. As EBNF makes for a very rigorous grammar definition, there are software tools available that automatically transform a set of EBNF rules into a parser. Programs that do this are called compiler compilers. The most famous of these is YACC (Yet Another Compiler Compiler), but there are of course many more.
You can see EBNF as a set of rules, called productions or production rules. Every rule describes a part of the overall syntax. You start with one start symbol (called S, by tradition) and then you define rules for what you can replace this symbol with. Gradually, this will form a complex language grammar composed by the set of strings you can produce when following these rules.
If you look at the example from above again, you see that this is an assignment; there's a symbol on the left, an assignment operator (which can also be written as :=), and a list of values on the right. You play the game by following the symbol definition down to the last occurrencethen on the right side of the assignment no symbols are given, but a final string called terminal, which is an atomic value.
EBNF defines three operators, which will look familiar to you from regular expressions:
Operator | Meaning |
? | Optional |
+ | Must occur one or more times |
* | Must occur zero or more times |
To define the grammar of language, which allows you to express floating-point numbers, this EBNF notation would be used:
S := SIGN? D+ (. D+)? D := [0-9] SIGN := "+"|"-"
The first line defines the start symbol, with the following sequence:
-
An optional sign, consisting either of + or -
-
One or more elements of the D production
-
Optionally, a dot, and again one or more elements of D production
Notice that EBNF allows operators to work on groups of symbols: (. D+)? means that this expression is optional.
The second line lists the finals (atoms) for the D production, the digits 0 to 9 in this case. The syntax used is the same as with regular expressions; a set is defined in a bracket expression. The third line defines the two possible signs. The pipe character (|) is used to denote alternatives: A|B means "A or B but not both."
That's a very basic explanation of EBNF. The XML specification defines additional syntax; for example, validity constraints and well-formedness constraintsit's explained in the Notation section of the spec, so we won't go into details here. More information about EBNF can be found in any modern compiler book.
Validity and Well-Formedness
There are two types of compliant XML documents: valid documents and well-formed documents. Any XML document is well-formed if it matches XML's basic syntax guidelines:
-
It contains one root element and an arbitrary number of elements below that element.
-
Elements are properly nested.
-
Attributes appear only once per element and are enclosed by single or double quotes. They cannot contain direct or indirect entity references to external entities. Nor can they contain an opening tag (<).
-
Entities must be declared before they're used, except for the standard entities.
-
Entities must not refer to themselves recursively.
For example, the following is a well-formed XML document:
<greeting>Hello world.</greeting>
But it's not a valid document. The XML specification defines it this way: An XML document is valid if it has an associated document type declaration and if the document complies with the constraints expressed in it. This means that any valid XML document is also well-formed. A well-formed document may be invalid if it doesn't adhere to the syntax laid out in the associated DTD. An ill-formed document can never be valid. An ill-formed document is not an XML document: It contains fatal errors and XML parsers are instructed to stop processing at this point. The distinction between valid and well-formed has two very important connotations to XML. First, it brings along two classes of XML parsers: those that care about validity of an XML document and those that don't: that is, validating and non-validating parsers. The XML specification lists ease of use for developers as a design goal, and indeed it's quite easy for any medium-level programmer to write a non-validating parser. Writing a validating parser is a different matter, through.
Second, the validity versus well-formedness concept divides XML applications into two categories. One range of applications treats XML as an extended data-storage format. Well-formed documents are used for data storage and display. For this task, a DTD is not necessary; a well-formed document is sufficient. You would achieve some level of code reuse with this approach; for example, you could reuse the code for parsing data and generating tags in later applications. But as soon as you want to exchange information as information (as opposed to treating it as pure data), you need to give the document a meaning and associate it with a DTD. In applications dealing with information processing and exchange, only valid documents are appropriate.
Now that you've learned about the basics of XML and related topics, let's put the gained knowledge into practice by looking at Expat, a non-validating parser built into PHP.
PHP and Expat
Expat is the parser that is responsible for XML processing in Mozilla, Apache, Perl, and many other projects. It can be compiled into PHP since version 3.0.6 and is part of the official Apache distribution since Apache 1.3.9. Since Expat is a non-validating parser, it's fast and smallwell suited for Web applications.
Event-Based API
There are two types of XML parser APIs: tree-based parsers that usually provide an interface to the Document Object Model (more about this later) and those that process XML documents with an event-based approach. Expat makes an event-based API available.
Event-based parsers have a data-centric view of XML documents. They parse the document from top to bottom and report eventssuch as the start of an element, the end of an element, starting of character data, etc.to the application, usually through callback functions. The "Hello World" example document from earlier in the chapter would be reported by an event-based parser as a series of these events:
-
Open Element: greeting
-
Open CDATA section, value: Hello World
-
Close Element: greeting
Unlike tree-based parsers, they don't create a structure representation of the document. This provides for a lower-level access and is much more efficient in terms of speed and resource usage. There's no need to hold the entire document in memory; indeed, documents can be much larger than your system's memory. Of course, it's still completely possible to create a native tree structure if you need to do so. Prior to parsing a document, event-based parsers generally require you to register callback functions that will get invoked when a certain event occurs. Expat is no exception. It defines six possible events plus one default handler:
Target | Function | Description |
---|---|---|
elements | xml_set_element_handler() | Opening and closing of elements |
character data | xml_set_character_data_handler() | Beginning of character data |
external entities | xml_set_external_entity_ref_handler() | Occurrence of an external entity |
unparsed external entities | xml_set_unparsed_entity_decl_handler | Occurrence of an unparsed external entity |
processing instructions | xml_set_processing_instruction_handler() | Occurrence of a processing instruction |
notation declarations | xml_set_notation_decl_handler() | Occurrence of a notation declaration |
default | xml_set_default_handler() | All events that have no assigned handler |
Let's start with a really basic example. The source code in Listing 7.2 forms a program to extract all comments from an XML document (remember, comments have the form <!-- … -->). The example registers only one handler that gets called for all events during the parsing. If you register another handler, for example using xml_set_character_data_handler(), the default handler would not be invoked for this specific eventthe default handler processes only "free" events with no assigned handler.
Listing 7.2. Extracting comments from an XML document.
require("xml.php3"); function default_handler($p, $data) { global $count; // count of comments found // Check if the current contains a comment if (ereg("!--", $data, $matches)) { $line = xml_get_current_line_number($p); // Insert a tab before new lines $data = str_replace("\n", "\n\t", $data); // Output line number and comment print "$line:\t$data\n"; // Increase count of comments found $count++; } } // Process the file passed as first argument to the script $file = $argv[1]; $count = 0; // Create the XML parser $parser = xml_parser_create(); // Set the default handler for all events xml_set_default_handler($parser, "default_handler"); // Parse file and check the return code $ret = xml_parse_from_file($parser, $file); if(!$ret) { // Print error message and die die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($parser)), xml_get_current_line_number($parser))); } // Free the parser instance xml_parser_free($parser);
The example works in a pretty straightforward way. First, the XML parser instance is created using xml_parser_create(). In all subsequent functions, you'll use the parser identifier you created this wayin a similar fashion to the result-identifier in the MySQL functions. Then the default handler is registered and the file is parsed. xml_parse_from_file() is a custom function we provide in a library; this function simply opens the file specified as the argument and parses it in blocks of 4KB. PHP's original XML functions xml_parse() and xml_parse_into_struct() operate on stringsby using wrappers for opening, reading, and closing a file and passing its contents to the respective functions, you can save time and code.
The default handler checks whether the current data section is a comment and outputs it if this is the case. Along with each comment, the current line number (returned by xml_get_current_line_number()) is also printed.
Now, while this example shows off the basic concepts of invoking the XML parser, registering callback functions, and processing data, it doesn't exactly demonstrate the common usage of an XML parser. It doesn't process information; raw data is just read in and scanned for a stringnothing that couldn't be done with traditional regular expressions. In most situations where you process XML, you'll want to keep at least a basic representation of the document structure.
Stacks, Depths, and Lists
Our second example illustrates how to remember the element depth the parser is currently processing. In the start-element handler the global $depth variable is increased by four; in the stop-element handler it's decreased by the same figure. This is the most reduced case of a parser stackno structure other than depth information is being kept. As an XML pretty printer, the example uses the depth to properly indent code. The handler functions simply apply a Cascading Style Sheet to the current data to produce nicely formatted output. The only other noteworthy part of the code is this line:
xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
This disables case folding for the parser, telling it that the case of element names should be preserved. If this option is enabled, all element names are transformed to uppercase. Usually, you'll want to turn this off, as case is important for element names in XML.
We won't print the source code of the example here because of its simplicity; you can find it on the CD-ROM. Figure 7.4 shows a screen shot of the output.
Figure 7.4. Output of the XML pretty printer.
Usually, this naive approach of maintaining just one depth variable is not enough. With event-based parsers, you'll usually end up using your own stacks or lists to maintain information about the document's structure. This is evidenced quite well by the next example, shown in Listing 7.3.
Listing 7.3. XMLStatscollecting statistical information about an XML document.
require("xml.php3"); // The first argument is the file to process $file = $argv[1]; // Initialize variables $elements = $stack = array(); $total_elements = $total_chars = 0; // The base class for an element class element { var $count = 0; var $chars = 0; var $parents = array(); var $childs = array(); } // Utility function to print a message in a box function print_box($title, $value) { printf("\n+%'-60s+\n", ""); printf("|%20s", "$title:"); printf("%14s", $value); printf("%26s|\n", ""); printf("+%'-60s+\n", ""); } // Utility function to print a line function print_line($title, $value) { printf("%20s", "$title:"); printf("%15s\n", $value); } // Sort function for usasort() function my_sort($a, $b) { return(is_object($a) && is_object($b) ? $b->count - $a->count: 0); } function start_element($parser, $name, $attrs) { global $elements, $stack; // Does this element already exist in the global $elements array? if(!isset($elements[$name])) { // No - add a new instance of class element $element = new element; $elements[$name] = $element; } // Increase this element's count $elements[$name]->count++; // Is there a parent element? if(isset($stack[count($stack)-1])) { // Yes - set $last_element to the parent $last_element = $stack[count($stack)-1]; // If there is no entry for the parent element in the current // element's parents array, initialize it to 0 if(!isset($elements[$name]->parents[$last_element])) { $elements[$name]->parents[$last_element] = 0; } // Increase the count for this element's parent $elements[$name]->parents[$last_element]++; // If there is no entry for this element in the parent's // elements' child array, initialize it to 0 if(!isset($elements[$last_element]->childs[$name])) { $elements[$last_element]->childs[$name] = 0; } // Increase the count for this element parent in the parent's // childs array $elements[$last_element]->childs[$name]++; } // Add current element to the stack array_push($stack, $name); } function stop_element($parser, $name) { global $stack; // Remove last element from the stack array_pop($stack); } function char_data($parser, $data) { global $elements, $stack, $depth; // Increase character count for the current element $elements[$stack[count($stack)-1]]->chars += strlen(trim($data)); } // Create Expat parser $parser = xml_parser_create(); // Set handler functions xml_set_element_handler($parser, "start_element", "stop_element"); xml_set_character_data_handler($parser, "char_data"); xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0); // Parse the file $ret = xml_parse_from_file($parser, $file); if(!$ret) { die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($parser)), xml_get_current_line_number($parser))); } // Free parser xml_parser_free($parser); // Free helper elements unset($elements["current_element"]); unset($elements["last_element"]); // Sort $elements array by element count uasort($elements, "my_sort"); // Loop through all elements collected in $elements while(list($name, $element) = each($elements)) { print_box("Element name", $name); print_line("Element count", $element->count); print_line("Character count", $element->chars); printf("\n%20s\n", "* Parent elements"); // Loop through the parents of this element, output them while(list($key, $value) = each($element->parents)) { print_line($key, $value); } if(count($element->parents) == 0) { printf("%35s\n", "[root element]"); } // Loop through the childs of this element, output them printf("\n%20s\n", "* Child elements"); while(list($key, $value) = each($element->childs)) { print_line($key, $value); } if(count($element->childs) == 0) { printf("%35s\n", "[no childs]"); } $total_elements += $element->count; $total_chars += $element->chars; } // Final summary print_box("Total elements", $total_elements); print_box("Total characters", $total_chars);
This application uses Expat to collect statistical data about an XML document. For each element, it prints a bunch of information:
-
How many times it occurred within the document
-
How much character data was found within this element
-
All parent elements encountered
-
All child elements
To achieve this, the script needs at the very least to know the parent element for the current element. This is not possible using the normal XML parseryou only get events for the current element, and no contextual information is recorded. Thus we needed to set up our own stack structure. We could have used a FIFO stack (First In, First Out) with two elements, but to give you a better example of keeping element nesting information within a data structure, we voted for a FILO (First In, Last Out) stack. This stack, which is a normal array, holds all currently open elements. In the open-element handler, the current element is pushed on top of the stack using array_push(). Accordingly, the end-element handler function removes the top element with array_pop().
A note on array_pop() and array_push(). These and many other useful functions dealing with arrays have been added only in PHP 4.0. We wanted to port them over to PHP 3.0, but it's difficult to implement them efficiently in native PHP (to backport it to PHP 3.0) because of the way unset() works. To pop an element off the stack, you would use a snippet like this:
unset($array[count($array) - 1]);
If this would work well, it would be trivial to implement array_pop()- however, it doesn't work well. With PHP, unset() leaves holes in the arrayit doesn't reset the "index counter." You can easily verify this yourself:
$array = array("a"); unset($array[0]); $array[] = "a"; var_dump($array);
The element a will now have the key 1, instead of the expected 0. This leads to fragmented arraysunsuitable for a stack. This behavior has its reasons with every other element in the array: If the hole was eliminated, the array would need to be reorganized, which would be undesirable in many situations. To work around this problem, we'd need an array_compact() versionwhich doesn't exist in PHP at the time of this writing. The only conclusion to draw is this: Use PHP 4.0. In the PHP 3.0 implementation of the example (see the CD-ROM), we had to use the $depth variable to keep track of the element nesting manually. This introduces another global variable and is not as elegant as array_pop() and array_push(), but it works.
To collect information about each element, the scripts needs to remember all occurrences of each element. We use a global array variable, $elements, to hold all distinct elements of the document. The array entries are instances of the element class, which has four properties (class variables):
Property | Description |
---|---|
count | The number of times the element was found in the document. |
chars | Bytes of character data within this element. |
parents | Parent elements. |
childs | Child elements. |
As you see, it's no problem to keep class instances within an array.
Tip: A peculiar language feature of PHP is that you can traverse class structures just like you would traverse associative arrays, using the while(list() = each()) loop shown in Chapter 1, "Development Concepts." It will show you all class variables and method names as strings.
Each time an element is found, the count element in the corresponding elements array item is incremented. In the parent's entry (parent meaning the last opened element tag), the current element's name is appended to the childs array entry. The parent element is added to the array entry with the key parents. The rest of the code loops through the elements array and its subarrays to display the statistics. While this produces a nice output, the code per se is neither of particular elegance nor does it consist of clever tricks: It's a loop like you probably use every day simply to get the job done.
DOMDocument Object Model
The other main family of XML parsers are those that enable access to a Document Object Model (DOM) structure. As you've seen, with event-based parsers you often have to set up your own data structures. The DOM approach avoids that requirement by building its own structure in main memory. Rather than responding to specific events, you work with this structure to process the document. While event-based parsers read an XML document in small chunks, reducing parsing memory usage and increasing performance, DOM parsers need to create an in-memory representation of the whole document. This uses more memorykeep this in mind when working with large documents.
The DOM Level 1.0 was defined as a standard (W3C Recommendation) in October 1998 by the (by now probably well known) W3C organization. You may have heard of the DOM standard already in another context: The term is also commonly used to describe the object model of HTML pages that can be accessed with JavaScript. For example, to read the value of a form field, you could use the following JavaScript snippet:
fieldvalue = document.myform.myfield.value;
Notice the hierarchy expressed in the statement. document is the root element and myform denotes an HTML form, within which myfield is a text field. Indeed, the HTML DOM is an extension of the core Document Object Model defined by the W3C. The DOM core represents the functionality used for XML documents, and also serves as the basis for the HTML DOM. It's a collection of objects that you use to access and manipulate the data and markup stored in an XML document. It defines the following:
-
A set of objects for representing the complete structure of an XML document
-
A model of how these objects can be combined
-
An interface for accessing and manipulating these objects
By abstracting the document, the DOM exposes a tree, with parent and child nodes, and methods like getAttribute() for the nodes. Put short, DOM provides you with a standard, object-oriented and tree-like interface to XML documents.
The DOM specification is programming-language-independent. The specification recommends an object-oriented implementation, thus requiring a language with at least basic object-orientation features. It defines a set of node types (interfaces), which taken together form the complete document. Some types of nodes may have child nodes, others are leaf nodes that cannot have anything below them. We'll continue by describing these node types, as they're outlined in the original W3C specification. Please refer to the specification for a detailed description of all methods and attributes of each instance.
Document
The Document interface is the root node of the structure tree. This interface can contain only one element, which is the XML document's root element. It can also contain the document type declaration associated with this document (organized in a DocumentType interface), and, if available, processing instructions or comments from outside the root element.
Since the other nodes are all placed below the Document node, the Document interface contains a number of methods to create subnodes. Using these functions, it's possible to construct a complete XML document programmatically. The specification also defines a method getElementsByTagName() to retrieve all elements with a given tag name in the document.
DocumentFragment
A DocumentFragment node is a portion of a complete XML document. It's often necessary to rearrange parts of a document or to extract part of it; for this, a lightweight object is needed to hold the resulting fragment. For example, imagine you want to construct a single book file out of many different chapter fileseach chapter could be read into the DocumentFragment object and inserted into the book's document structure. Without a way to organize fragments of documents, you'd have to add each element of each chapter one by one to the book document.
To make it even easier, the specification defines that when DocumentFragment is inserted into a node, only the children of the DocumentFragment and not the DocumentFragment itself are inserted into the node.
DocumentType
The DocumentType node holds the document type declaration of a document, if present. This interface is read-only; it cannot be altered through the DOM at this time.
Element
Each element in a document is represented by an Element node. To get the name of the element, the tagName property can be used. This interface also defines a series of functions to set and get element attributes, and to access sub-elements.
Attr
An Attr node represents an element attribute in an Element object. Name and value of the attribute can be read for the name and value properties of the interface. The specified property tells you whether the user specified a value for this Attr or the value is the default string specified in the DTD.
EntityReference
This node represents an entity reference found in the XML document. Note that character references (for example, <) are expanded by the XML parser and are thus not made available as EntityReference nodes.
Entity
This node represents an entity, either parsed or unparsed.
ProcessingInstruction
The ProcessingInstruction node represents a processing instruction (PI) in a document. It has only two attributes, namely target (the PI target) and data (the contents).
Comment
This CharacterData interface represents the content of a comment, i.e. all the characters between <!-- and -->. It has no further attributes or methods.
Text
The Text CharacterData interface represents the character data (textual content) of an Element or Attr note. The Text interface has no attributes, and only one method, namely splitText(). This method splits one Text node into two, which can be useful for rearranging content.
CDATASection
The CDATASection interface inherits the Text interface (and with it the CharacterData interface) and holds the CDATA section.
Notation
This node represents a notation declared in the document type declaration.
Basic Interfaces
All these objects inherit the Node interface, which is the primary basic datatype for the DOM. It represents a single node in the document tree structure. The Node interface defines the attributes and methods you'll use most often when dealing with the DOM. To traverse a document, for example, you would use the childNodes attribute containing all children and the nextSibling attribute containing the next node on the same level. Methods like appendChild() and removeChild() can be used to alter the tree structure.
The only objects not directly derived from a Node interface are CDATASection, Text, and Comment. Text and Comment are derived from the CharacterData interface; CDATASection inherits Text. The CharacterData interface extends Node with a set of attributes and methods for accessing character data. For example, you can use substringData() to extract part of the character data.
Example: Analyzing a Short Document with the DOM
The easiest way to get an idea about the concrete implementation of the DOM is by seeing how a sample XML document would be handled by a DOM-compliant processor. Let's create a short book document:
<?xml version="1.0"?> <!DOCTYPE book SYSTEM "docbookx.dtd"> <book> <title> Cutting-Edge Applications </title> <para language="en"> Sample paragraph. </para> </book>
A DOM representation of this document will be organized in a hierarchical structure like the one shown in Figure 7.5. In a DOM-compliant API, code could be similar to the following pseudocode:
// Construct Document class instance $doc = new Document("file.xml"); // Output the root element's name printf("Root element: %s<p>", $doc->documentElement->tagName); // Get all elements below the root node $node_list = $doc->getElementsbyTagName("*"); // Traverse the returned node list for($i=0; $i<$node_list->length; $i++) { // Create node $node = $node_list->item($i); // Output node name and value printf("Node name: %s<br>", $node->nodeName); printf("Node value: %s<br>", $node->nodeValue); }
Figure 7.5. DOM structure.
LibXMLA DOM-Based XML Parser
Since version 4.0, a new XML parser is built into PHP: LibXML. Daniel Veillard originally created this parser for the Gnome project to offer a DOM-ready parser for managing complex data exchange, and Uwe Steinman integrated it into PHP.
While LibXML's internal document representation is very close to the DOM interfaces, it's misleading to call LibXML a DOM parser: Parsing and DOM usage really happen at different times in a document's life. It would be feasible to create an API above Expat to provide a DOM interface. The LibXML library makes this much easier, thoughit's merely a matter of changing the API to match the DOM specification. Indeed, there is a GDome module in Gnome, which implements a DOM interface for LibXML.
Note: At the time of this writing, the LibXML API in PHP was being finalized. It was unstable and contained bugsnonetheless it already showed the tremendous benefits the finished LibXML API will offer. Therefore, we decided to document the basic principles here and provide some examples; if changes occur, we'll document them on the book's Web site.
Overview
Most developers will agree that an XML document is best represented in a tree structure. LibXML provides a nice API to construct trees and DOM-like data structures from an XML file. When you parse a document with LibXML, PHP constructs a set of classes, and you'll work with them directly. By invoking functions on these classes, you can access all levels of the structure and modify the document.
The two most important objects you'll spot when working with LibXML are document and node objects.
XML Documents
The abstract XML document is represented in a document object. Such objects are created by the functions xmldoc(), xmldocfile(), and new_xmldoc().
The function xmldoc() takes as its only argument a string containing an XML document. The xmldocfile() function behaves very similarly, but takes a filename as argument. To construct a new, blank XML document, you can use new_xmldoc().
All three functions return a document object, which has four associated methods and one class variable:
-
root()
-
add_root()
-
dtd()
-
dumpmem()
-
version
The function root() returns a node object containing the root element of the document. On empty documents as created by new_xmldoc(), you can add a root element using add_root(), which will return a node object as well. The function add_root() expects the name of the element as first argument when called as class method. You can also call it as global function, but then you need to pass a document class instance as first argument, and the name of the root element as second argument.
The dtd() function returns a DTD object with no methods, and the class variables name, sysid, and extid. The name of a DTD is always the name of the root element. The variable sysid contains the system identifier (for example, docbookx.dtd); the extid variable contains the external or public identifier. To convert the in-memory structure to a string, you can use the dumpmem() function. The version class variable contains the document's XML version, usually 1.0 today.
With these explanations, you're ready for a first, simple example. Let's construct a Hello World XML document with LibXML:
$doc = new_xmldoc("1.0"); $root = $doc->add_root("greeting"); $root->content = "Hello World!"; print(htmlspecialchars($doc->dumpmem()));
This will result in a well-formed XML document:
<?xml version="1.0"?> <greeting>Hello World!</greeting>
The example also shows one property you don't know yetaccessing the contents of a node object.
Nodes
The Tao Te King says everything is Tao. In XML parsing, everything is a node. Elements, attributes, text, PIs, and so forthfrom a programmer's point of view, you can treat them all in a very similar way, because they're nodes.
As we've already mentioned, nodes can be the most basic, atomic structure in an XML document. A node object has the following associated functions and variables:
-
parent()
-
children()
-
new_child()
-
getattr()
-
setattr()
-
attributes()
-
type
-
name
-
if available, content
With these functions and properties, you can get all available information about a node. You can access its attributes, child nodes (if any), and parent node. And you can modify the tree by adding children or setting attributes. Listing 7.4 shows the functions in action. This is the XML pretty printer mentioned earlier in the Expat section, ported to LibXMLinstead of registering handler functions, it applies different formatting according to the node's type. Each node has an associated type. The type identifier is a PHP constant, and you can see the complete list in the example's source. Using the children() function, which returns the node's child elements (as node objects), it's easy to loop through the document. The example performs the loop recursively by calling the output_node() function again.
Listing 7.4. XML pretty printerexample using the LibXML functions.
// Define tab width define("INDENT", 4); function output_node($node, $depth) { // Different action per node type switch($node->type) { case XML_ELEMENT_NODE: for($i=0; $i<$depth; $i++) print(" "); // Print start element print("<span class='element'><"); print($node->name); // Get attribute names and values $attribs = $node->attributes(); if(is_array($attribs)) { while(list($key, $value) = each($attribs)) { print(" $key = <span class='attribute'>$value</span>"); } } print("></span><br>"); // Process children, if any $children = $node->children(); for($i=0; $i < count($children); $i++) { output_node($children[$i], $depth+INDENT); } // Print end element for($i=0; $i<$depth; $i++) print(" "); print("<span class='element'></"); print($node->name); print("></span><br>"); break; case XML_PI_NODE: for($i=0; $i<$depth; $i++) print(" "); printf("<span class='pi'><?%s %s?></span><br>", $node->name, $node->content); break; case XML_COMMENT_NODE: for($i=0; $i<$depth; $i++) print(" "); print("<span class='element'><!-- </span>"); print($node->content); print("<span class='element'> --></span><br>"); break; case XML_TEXT_NODE: case XML_ENTITY_REF_NODE: case XML_ENTITY_REF_NODE: case XML_DOCUMENT_NODE: case XML_DOCUMENT_TYPE_NODE: case XML_DOCUMENT_FRAG_NODE: case XML_CDATA_SECTION_NODE: case XML_NOTATION_NODE: case XML_GLOBAL_NAMESPACE: case XML_LOCAL_NAMESPACE: default: for($i=0; $i<$depth; $i++) print(" "); printf("%s<br>", isset($node->content) ? $node->content : ""); } } // Output stylesheet ?> <style type="text/css"> <!-- .xml { font-family: "Courier New", Courier, mono; font-size: 10pt; color: #000000} .element { color: #0033CC} .attribute { color: #000099} .pi { color: #990066} --> </style> <span class="xml"> <? // Process the file passed as first argument to the script $file = "test.xml"; // Initial indenting $depth = 0; // Check if file exists if(!file_exists($file)) { die("Can't find file \"$file\"."); } // Create xmldoc object from file $doc = xmldocfile($file) or die("XML error while parsing file \"$file\""); // Access root node $root = $doc->root(); // Start traversal output_node($root, $depth); // End stylesheet span print("</span>");
One of the great advantages of LibXML over Expat is that you can also use it to construct XML documents. This avoids messing around with custom XML creation routines and frees you from tasks like remembering the nesting level to properly close tags. Listing 7.5 takes our earlier Hello World example a step further and constructs a complete RSS document (RSS stands for Rich Site Summary, an XML format to provide content information for Web sites). It uses setattr() to add attributes to an element and new_child() to add elements to a node. Have you noted the way new_child() is used? The function returns a node object, and you can simply discard that return value if you don't need ityou only need to assign it to a variable if you want to add child elements to the note you've just created.
Listing 7.5. Using LibXML routines to construct XML documents.
$doc = new_xmldoc("1.0"); $root = $doc->add_root("rss"); $root->setattr("version", "0.91"); $channel = $root->new_child("channel", ""); $channel->new_child("title", "XML News and Features from XML.com"); $channel->new_child("description", "XML.com features a rich mix of information and services for the XML community."); $channel->new_child("language", "en-us"); $channel->new_child("link", "http://xml.com/pub"); $channel->new_child("copyright", "Copyright 1999, O'Reilly and Associates and Seybold Publications"); $channel->new_child("managingEditor", "dale@xml.com (Dale Dougherty)"); $channel->new_child("webMaster", "peter@xml.com (Peter Wiggin)"); $image =$channel->new_child("image", ""); $image->new_child("title", "XML News and Features from XML.com"); $image->new_child("url", "http://xml.com/universal/images/xml_tiny.gif"); $image->new_child("link", "http://xml.com/pub"); $image->new_child("width", "88"); $image->new_child("height", "31"); print(htmlspecialchars($doc->dumpmem()));
XML Trees
The methods outlined above construct separate objects for the document and for each node. While this is great for looping through the document as shown in the XML pretty printer, accessing single elements tends to get a bit cumbersome. Do you remember our sample Hello World document from earlier in the chapter?
<?xml version="1.0"?> <greeting>Hello World!</greeting>
To access the contents of the root element, you'd have to use the following code:
// Create xmldoc object from file $doc = xmldocfile("test.xml") or die("XML error while parsing file \"$file\""); // Access root node $root = $doc->root(); // Access root's children $children = $root->children(); // Print first child's content print($children[0]->content);
And that's for a depth of one; imagine how you'd have to continue with deeper nested elements. If you think that this is a bit too much work, we agree. Fortunately, Uwe Steinman agrees too, and has provided a more elegant method of random access to document elements: xmltree(). This function creates a structure of PHP objects, representing the whole XML document. When you pass it a string containing an XML document as first argument, the function returns a document object. The object is a bit different from the one described earlier, though: It doesn't allow functions to be called, but sets up properties of the same. Instead of getting a list of child elements with a children() call, the children are already present in the structure (in the children class variable)making it easy to access elements in every depth. Accessing the contents of the greeting element would therefore be done with the following call:
// Create xmldoc object from file $doc = xmldocfile(join("", file($file)) or die("XML error while parsing file \"$file\""); print($doc->root->children[0]->content);
That looks infinitely better now. When you dump the structure returned by xmltree() with var_dump(), you get the following output:
object(Dom document)(2) { ["version"]=> string(3) "1.0" ["root"]=> object(Dom node)(3) { ["type"]=> int(1) ["name"]=> string(8) "greeting" ["children"]=> array(1) { [0]=> object(Dom node)(3) { ["name"]=> string(4) "text" ["content"]=> string(12) "Hello World!" ["type"]=> int(3) } } } }
You see that this is one large structure, with the whole document ready in place. The actual parts of the structure are still document or object nodes; indeed, internally the same class definitions are used. In contract to objects created with xmldoc() and friends, though, you can't invoke functions on these structures. Consequently, the structure returned by xmltree() is read-only at this timeto construct XML documents, you need to use the other methods.