- XML Support in Java
- XML and .NET
- Reading and Writing XML
- Using the DOM API in .NET
- Summary
20.3 Reading and Writing XML
At the core of the .NET Framework XML classes are two abstract classes: XmlReader and XmlWriter. These are found in the System.Xml namespace.
XmlReader provides a fast, forward-only, read-only cursor for processing an XML document stream. XmlWriter provides an interface for producing XML document streams. Both classes imply a streaming model that doesn't require an expensive in-memory cache. This makes them both attractive alternatives to the classic DOM approach. Both XmlReader and XmlWriter are abstract base classes and define the functionality that all derived classes must support.
There are three concrete implementations of XmlReader:
XmlTextReader. This class provides forward-only, read-only access to a text-based stream of XML data. The reader is advanced using any of the read methods, and properties reflect the value of the current node (the node on which the reader is positioned). This class does not provide data validation. It's ideal for fast parsing of XML text that is well formed. It uses the DTD for checking whether the XML is well formed but does not validate using the DTD.
XmlNodeReader. This class reads an XML DOM subtree. It does not support DTD or schema validation.
XmlValidatingReader. This class represents a reader that provides DTD, XML-Data Reduced (XDR) schema, and XML Schema Definition Language (XSD) schema validation.
The XmlTextWriter class is a writer that provides a fast, noncached, forward-only way of generating streams or files containing XML data.
For the examples in this chapter we assume that our XML document is stored in a file called people.xml, whose contents are shown in Listing 20.1.
Listing 20.1 The Sample XML Document (C#)
<People> <Person id="1" ssn="555121212"> <Name> <FirstName>Joe</FirstName> <LastName>Suits</LastName> </Name> <Address> <Street>1800 Success Way</Street> <City>Redmond</City> <State>WA</State> <ZipCode>98052</ZipCode> </Address> <Job> <Title>CEO</Title> <Description>Wears the nice suit</Description> </Job> </Person> <Person id="2" ssn="666131313"> <Name> <FirstName>Linda</FirstName> <LastName>Sue</LastName> </Name> <Address> <Street>1302 American St.</Street> <City>Paso Robles</City> <State>CA</State> <ZipCode>93447</ZipCode> </Address> <Job> <Title>Attorney</Title> <Description>Stands up for justice</Description> </Job> </Person> <Person id="3" ssn="777141414"> <Name> <FirstName>Jeremy</FirstName> <LastName>Boards</LastName> </Name> <Address> <Street>34 Palm Avenue</Street> <City>Waikiki</City> <State>HI</State> <ZipCode>98052</ZipCode> </Address> <Job> <Title>Pro Surfer</Title> <Description>Rides the big waves</Description> </Job> </Person> <Person id="4" ssn="888151515"> <Name> <FirstName>Joan</FirstName> <LastName>Page</LastName> </Name> <Address> <Street>700 Webmaster Road</Street> <City>Redmond</City> <State>WA</State> <ZipCode>98073</ZipCode> </Address> <Job> <Title>Web Site Developer</Title> <Description>Writes the pretty pages</Description> </Job> </Person> </People>
We will use the XML in Listing 20.1 for the rest of the samples.
20.3.1 Reading XML in .NET
Listing 20.2 shows the usage of XmlTextReader.
Listing 20.2 Browsing the XML File Using XmlTextReader (C#)
using System; using System.Xml; public class Test { public static void Main(string[] args) { XmlTextReader reader = new XmlTextReader("c:\\people.xml"); reader.WhitespaceHandling = WhitespaceHandling.None; //Moves the reader to the root element. reader.MoveToContent(); Console.WriteLine("Name,Value,Type "); while (reader.Read()) { Console.Write(reader.Name+","); Console.Write(reader.Value+","); Console.Write(reader.NodeType); Console.WriteLine(); } reader.Close(); } }
Here is a partial output of Listing 20.2:
Name,Value,Type People,,Element Person,,Element Name,,Element FirstName,,Element ,Joe,Text FirstName,,EndElement LastName,,Element ,Suits,Text LastName,,EndElement Name,,EndElement Address,,Element Street,,Element ,1800 Success Way,Text Street,,EndElement City,,Element ,Redmond,Text City,,EndElement
We first initialize an XmlTextReader by passing it the URL of the XML file:
XmlReader reader = new XmlTextReader("c:\\people.xml");
Next, we tell the XmlTextReader to ignore white space nodes when parsing the XML:
reader.WhitespaceHandling = WhitespaceHandling.None;
Then we position the reader to the root content node. In XML, _nodes of the following types are considered noncontent nodes: _ProcessingInstruction, DocumentType, Comment, Whitespace, and Sig_nificantWhitespace. The following code skips any leading noncontent node(s) at the beginning of the file and positions the reader to the first content node, <People> :
//Moves the reader to the root element. reader.MoveToContent();
When an XmlTextReader is first initialized there is no current node,_so the first call to Read moves to the first node in the document. When an XmlReader reaches the end of the document, it doesn't walk off the end, leaving the document in an indeterminate state; instead, it simply returns false when there are no more nodes to process.
Calling reader.Read() in a while loop moves the cursor forward through all the nodes. At the end of each call to the loop, the cursor moves to the next node. Note that any code inside the while loop is applied only to the current visited node.
So far, we have seen how to browse the file. Listing 20.3 shows how to gather useful information from the file. Let's say we are interested in knowing all job titles in the people.xml file. Listing 20.3 shows how to achieve that.
Listing 20.3 Fetching Values of All Text Nodes Named "Title" (C#)
using System; using System.Xml; public class Test { public static void Main(string[] args) { XmlTextReader reader = new XmlTextReader("c:\\people.xml"); //Moves the reader to the root element. reader.MoveToContent(); while (reader.Read()) { if (reader.Name.Equals("Title") && reader.NodeType == XmlNodeType.Element) { Console.WriteLine(reader.ReadElementString("Title")); } } reader.Close(); } }
The output of Listing 20.3 is as follows:
CEO Attorney Pro Surfer Web Site Developer
The <Title> node is a text-only element, and so we use the method reader. The ReadElementString method is a helper method to obtain the text and value of that node. It is important to call this method for the right node. Note that we first check that the enclosing node name is "Title" and that it is of the type Element. After we confirm that the current node being visited is <Title> and not </Title>, we call ReadElementString.
The XmlReader class provides a forward-only cursor, and hence all the nodes are sequentially visited and checked to see whether they meet the criteria. Although this approach works, it is inefficient for any modest data mining work. The XPath API (discussed in Section 20.4.1) gives more control of the search process and should be used for XML searching.
Read() doesn't encounter attribute nodes because they aren't considered part of a document's hierarchical structure. Attributes are typically considered metadata attached to structural elements. When the current node is an element, its attributes can be accessed through calls to GetAttribute by name or index. Listing 20.4 shows how to access attri-butes of a node.
Listing 20.4 Displaying the Attributes of a Node (C#)
using System; using System.Xml; public class Test { public static void Main(string[] args) { XmlTextReader reader = new XmlTextReader("c:\\people.xml"); //Moves the reader to the root element. reader.MoveToContent(); while (reader.Read()) { if (reader.HasAttributes) { Console.WriteLine("Id = "+reader.GetAttribute("id")+" SSN ="+reader.GetAttribute("ssn")); } } reader.Close(); } }
The output of Listing 20.4 is as follows:
Id = 1 SSN =555121212 Id = 2 SSN =666131313 Id = 3 SSN =777141414 Id = 4 SSN =888151515
Not all the nodes in the XML file have attributes, and therefore we use the HasAttributes property of the node to check whether the node has attributes defined.
An XML document contains data just as an RDBMS table contains rows, but RDBMSs provide SQL for accessing that data. To efficiently query and search for data in an XML document, you should use the XPath API (discussed later). Although XmlReader provides an API to visit all nodes, it is inefficient to use it by itself for any kind of modest data mining effort on an XML document.
Unlike the SAX event modelwhich fires off events upon encountering tags and requires you to have ContentHandler classes with specialized method signaturesthe .NET API allows you to conveniently scan an entire XML document inside a single while loop. Although you can parse the entire document inside this loop, it is advisable to break the document into manageable chunks and hand off those chunks to classes or methods that can parse them. This technique is especially useful when the XML data maps to a code element (such as a class).
For example, suppose we're using people.xml to create Person objects. Listing 20.5 shows how the XML can be parsed to create meaningful data.
Listing 20.5 Parsing XML to Create Objects (C#)
using System; using System.IO; using System.Xml; using System.Collections; public class PersonReader { private XmlTextReader reader; public PersonReader (string url) { reader = new XmlTextReader (url); reader.MoveToContent(); } public bool NextPerson() { return reader.Read(); } public Person GetPerson() { return new Person(reader); } public static void Main(string[] args) { PersonReader reader = new PersonReader("c:\\people.xml"); while (reader.NextPerson()) { Person p = reader.GetPerson(); if (p != null) { Console.WriteLine(p.GetZip()); } } } } public class Person { Hashtable attributes; public Person (XmlTextReader reader) { attributes = new Hashtable(); attributes.Add("id", null); attributes.Add("ssn", null); attributes.Add("FirstName", null); attributes.Add("LastName", null); attributes.Add("City", null); attributes.Add("State", null); attributes.Add("Street", null); attributes.Add("ZipCode", null); attributes.Add("Title", null); attributes.Add("Description", null); while (reader.Read()) { if (reader.NodeType == XmlNodeType.EndElement && reader.Name.Equals("Person")) break; if (reader.HasAttributes) { attributes["id"] = reader.GetAttribute("id"); attributes["ssn"] = reader.GetAttribute("ssn"); } string name = reader.Name; if (attributes.ContainsKey(name)) { attributes[name] = reader.ReadElementString(name); } } } public object GetID() { return attributes["id"]; } public object GetFirstName() { return attributes_ ["FirstName"]; } public object GetLastName() { return attributes["LastName"]; } public object GetSSN() { return attributes["ssn"]; } public object GetCity() { return attributes["City"]; } public object GetState() { return attributes["State"]; } public object GetStreet() { return attributes["Street"]; } public object GetZip() { return attributes["ZipCode"]; } public object GetTitle() { return attributes["Title"]; } public object GetDescription() { return attributes["Description"]; } }
The output of Listing 20.5 is as follows:
98052 93447 98052 98073
We start by creating a PersonReader, which wraps XmlTextReader. Then NextPerson and GetPerson encapsulate the navigation over the XmlTextReader. The Person object contains a Hashtable of attributes. The constructor of the Person object merely iterates through the reader object, retrieves tag values it is interested in, and sets its state in the attributes Hashtable . After the state of the Person object is initialized, the accessor methods (getter methods) can access the state of the Person object.
So far, the XML parsed has been well formed but not necessarily valid. XmlTextReader supports validation against DTDs, XML-Data Reduced (XDR) schemas, and XSD schemas. Consumers can use the Validation property to control how the reader performs validation. In its default _state, XmlTextReader will auto-detect DTDs and schemas to process entities and default attribute values. To turn on validation, you must provide a ValidationHandler. When set to Validation.Schema, the reader automatically detects whether an XDR or XSD schema is in use. Validation_Handler is an event handler for the reader to use when it encounters validation errors. The callback method is associated with the reader through the ValidationEventHandler event property.
The DTD for people.xml is stored in a file called people.dtd:
<!ELEMENT People (Person+)> <!ELEMENT Person (Name,Address*,Job*)> <!ELEMENT Name (FirstName,LastName)> <!ELEMENT FirstName (#PCDATA)> <!ELEMENT LastName (#PCDATA)> <!ELEMENT Address (Street,ZipCode,City,State)> <!ELEMENT Street (#PCDATA)> <!ELEMENT ZipCode (#PCDATA)> <!ELEMENT City (#PCDATA)> <!ELEMENT State (#PCDATA)> <!ELEMENT Job (Title,Description)> <!ELEMENT Title (#PCDATA)> <!ELEMENT Description (#PCDATA)>
Note that we have intentionally omitted the definition of the attributes for the Person node.
Next, we add the following line to the top of the people.xml file:
<!DOCTYPE people SYSTEM "people.dtd">
Listing 20.6 shows how to use a validating XML reader.
Listing 20.6 Validating XML against a DTD (C#)
using System; using System.Xml; using System.Xml.Schema; public class Test { public static void Main(string[] args) { XmlValidatingReader reader = new XmlValidatingReader ( new XmlTextReader("c:\\people.xml")); reader.ValidationType = ValidationType.DTD; reader.ValidationEventHandler += new ValidationEventHandler (CallBack); //Moves the reader to the root element. reader.MoveToContent(); while (reader.Read()) { } reader.Close(); } public static void CallBack (Object obj, ValidationEventArgs args) { Console.WriteLine(args.Message); } }
The output of Listing 20.6 is as follows:
The root element name must match the DocType name. An error occurred at file:///c:/people.xml(2, 2). The 'id' attribute is not declared. An error occurred at file:///c:/people.xml(3, 13). The 'ssn' attribute is not declared. An error occurred at file:///c:/people.xml(3, 20). Element 'Address' has invalid child element 'City'. Expected 'ZipCode'. An error occurred at file:///c:/people.xml(10, 14). The 'id' attribute is not declared. An error occurred at file:///c:/people.xml(20, 13). The 'ssn' attribute is not declared. An error occurred at file:///c:/people.xml(20, 20). Element 'Address' has invalid child element 'City'. Expected 'ZipCode'. An error occurred at file:///c:/people.xml(27, 14). The 'id' attribute is not declared. An error occurred at file:///c:/people.xml(37, 13). The 'ssn' attribute is not declared. An error occurred at file:///c:/people.xml(37, 20). Element 'Address' has invalid child element 'City'. Expected 'ZipCode'. An error occurred at file:///c:/people.xml(44, 14). The 'id' attribute is not declared. An error occurred at file:///c:/people.xml(54, 13). The 'ssn' attribute is not declared. An error occurred at file:///c:/people.xml(54, 20). Element 'Address' has invalid child element 'City'. Expected 'ZipCode'. An error occurred at file:///c:/people.xml(61, 14).
Listing 20.6 starts by creating a validating reader:
XmlValidatingReader reader = new XmlValidatingReader (new XmlTextReader("c:\\people.xml"));
We then set the target against which the XML is to be validated. This can be an XSD schema or a DTD. Because we are validating against the DTD, we specify that as follows:
reader.ValidationType = ValidationType.DTD;
Next, we specify the callback handler that will be called during the validation process. Note that ValidationEventHandler is a delegate of the reader object.
reader.ValidationEventHandler += new ValidationEventHandler (CallBack);
We get errors during parsing of the XML, as shown in the output. We omitted the attribute definition in the DTD, and the parser makes it obvious when validating the XML. These errors are printed thanks to the following line in the CallBack method:
Console.WriteLine(args.Message);
20.3.2 Writing XML in .NET
So far, we have seen how to read XML from a data stream; now we will explore writing XML to a file or a memory stream. XmlTextWriter is a concrete class, derived from XmlWriter, for writing character streams. It supports many different output stream types (file URI, stream, and TextWriter) and is configurable. You can specify things such as whether to provide namespace support, indentation options, the quote character to be used for attribute values, and even the lexical representation to be used for typed values.
The Person class shown in Listing 20.5 now has a ToXml() method that prints the XML representation of the Person object (see Listing 20.7).
Listing 20.7 Printing the State of an Object in XML (C#)
using System; using System.Text; using System.IO; using System.Xml; using System.Collections; public class Person { Hashtable attributes; public Person () { attributes = new Hashtable(); attributes.Add("id", null); attributes.Add("ssn", null); attributes.Add("FirstName", null); attributes.Add("LastName", null); attributes.Add("City", null); attributes.Add("State", null); attributes.Add("Street", null); attributes.Add("ZipCode", null); attributes.Add("Title", null); attributes.Add("Description", null); } public object GetID() { return attributes["id"]; } public object GetFirstName() { return attributes["First_Name"]; } public object GetLastName() { return attributes["Last_Name"]; } public object GetSSN() { return attributes["ssn"]; } public object GetCity() { return attributes["City"]; } public object GetState() { return attributes["State"]; } public object GetStreet() { return attributes["Street"]; } public object GetZip() { return attributes["ZipCode"]; } public object GetTitle() { return attributes["Title"]; } public object GetDescription() { return attributes["Description"]; } public void SetID(object o) { attributes["id"] = o; } public void SetFirstName(object o) { attributes["FirstName"]= o; } public void SetLastName(object o) { attributes["LastName"]= o; } public void SetSSN(object o) { attributes["ssn"] = o; } public void SetCity(object o) { attributes["City"] = o; } public void SetState(object o) { attributes["State"] = o; } public void SetStreet(object o) { attributes["Street"] = o; } public void SetZip(object o) { attributes["ZipCode"] = o; } public void SetTitle(object o) { attributes["Title"] = o; } public void SetDescription(object o) { attributes["Description"] = o; } public void ToXml() { XmlTextWriter w = new XmlTextWriter (Console.Out); w.Formatting = Formatting.Indented; w.WriteStartElement("Person"); w.WriteStartAttribute("id", "_"); w.WriteString((string)GetID()); w.WriteEndAttribute(); w.WriteStartAttribute("ssn", "_"); w.WriteString((string)GetSSN()); w.WriteEndAttribute(); w.WriteStartElement("Name"); w.WriteElementString("FirstName", (string)GetFirstName()); w.WriteElementString("LastName", (string)GetFirstName()); w.WriteEndElement(); w.WriteStartElement("Address"); w.WriteElementString("Street", (string)GetStreet()); w.WriteElementString("City", (string)GetCity()); w.WriteElementString("State", (string)GetState()); w.WriteElementString("ZipCode", (string)GetZip()); w.WriteEndElement(); w.WriteStartElement("Job"); w.WriteElementString("Title", (string)GetTitle()); w.WriteElementString("Description", (string)Get_Description()); w.WriteEndElement(); w.WriteEndElement(); } public static void Main(string[] args) { Person p = new Person(); p.SetID("3"); p.ToXml(); } }
The output of Listing 20.7 is as follows:
<Person id="3" ssn=""> <Name> <FirstName /> <LastName /> </Name> <Address> <Street /> <City /> <State /> <ZipCode /> </Address> <Job> <Title /> <Description /> </Job> </Person>
In the ToXml method we start by creating an XmlTextWriter:
XmlTextWriter w = new XmlTextWriter (Console.Out);
The XmlTextWriter constructor is overloaded to write the XML to any stream (file, memory, string). In this case we chose to print it on the console.
Next, we set the formatting to Indented so that output is formatted properly:
w.Formatting = Formatting.Indented;
The rest of the ToXml method consists of calling the appropriate WriteXXXX() method.