Building a SAX-Style Push Model Using the XmlTextReader
If you've worked with XML much in the past, you're probably familiar with the Simple API for XML (SAX) that relies on an event-based push model. SAX has become popular for parsing XML documents in a fast and efficient manner.
Creating a SAX parser provides an excellent stage for displaying just how powerful the XmlTextReader can be in .NET applications. Through seeing the process of creating the SAX parser, you'll become more familiar with different ways that the XmlTextReader can be used and see a few other things you can do with .NET classes.
If you've used SAX in the past, you're probably intimately familiar with the different parts of the SAX parser ContentHandler interface. Listing 1 shows this SAX interface converted to a .NET interface.
Listing 1The IContentHandler Interface (IContentHandler.cs)
1: using XmlParsers.Sax.Helpers; 2: interface IContentHandler { 3: 4: void setDocumentLocator(Locator locator); 5: 6: void startDocument(); 7: 8: void endDocument(); 9: 10: void processingInstruction(string target, string data); 11: 12: void startPrefixMapping(string prefix, string uri); 13: 14: void endPrefixMapping(string prefix); 15: 16: void startElement(string namespaceURI, string localName, 17: string rawName, Attributes atts); 18: 19: void endElement(string namespaceURI, string localName,string rawName); 20: 21: void characters(char[] ch, int start, int end); 22: 23: void ignorableWhitespace(char[] ch, int start, int end); 24: 25: void skippedEntity(string name);
If you're not familiar with this interface, it's actually quite simple to understand. The different interface members are called by the SAX parser as it parses an XML document. For example, when the document parsing starts, startDocument() is called. When an element type node is found within the document, startElement() is found and any attributes associated with the element are passed as arguments (whether you want them or not). This push process continues until the end of the document is reached (endDocument()).
A few classes are required by SAX that are not built in to the .NET platform. Fortunately, creating custom classes to emulate the classes that the SAX ContentHandler expects is very straightforward. Listing 2 shows some helper classes and a struct that were created to enable SAX-like functionality in the IContentHandler interface (and in the IErrorHandler interface, which is not shown here). More specifically, the code takes care of creating a SAXParseException, Attributes, and Locator class. The Attributes class contains a collection of attributes that will be passed as an argument to startElement() in the ContentHandler. The Locator class allows the line position and column position of the SAX parser to be tracked as it is parsing the XML document.
Listing 2Creating SAX Helpers (SAXHelpers.cs)
1: namespace XmlParsers.Sax.Helpers { 2: using System; 3: using System.Collections; 4: public class SAXParseException { 5: string lineNumber = ""; 6: string systemID = ""; 7: string message = ""; 8: public string getLineNumber() { 9: return lineNumber; 10: } 11: public string getSystemID() { 12: return systemID; 13: } 14: public string getMessage() { 15: return message; 16: } 17: public string LineNumber { 18: set { 19: lineNumber = value; 20: } 21: } 22: public string SystemID { 23: set { 24: systemID = value; 25: } 26: } 27: public string Message { 28: set { 29: message = value; 30: } 31: } 32: } 33: 34: public class Locator { 35: int lineNumber; 36: int columnNumber; 37: public Locator() { 38: lineNumber = 0; 39: columnNumber = 0; 40: } 41: public int LineNumber { 42: set { 43: lineNumber = value; 44: } 45: } 46: public int ColumnNumber { 47: set { 48: columnNumber = value; 49: } 50: } 51: public int getLineNumber() { 52: return lineNumber; 53: } 54: public int getColumnNumber() { 55: return columnNumber; 56: } 57: } 58: 59: public struct SaxAttribute { 60: public string Name; 61: public string NamespaceURI; 62: public string Value; 63: } 64: 65: public class Attributes { 66: public ArrayList attArray = new ArrayList(); 67: public int getLength() { 68: return attArray.Count; 69: } 70: public string getQName(int index) { 71: SaxAttribute saxAtt = (SaxAttribute)attArray[index]; 72: return saxAtt.Name; 73: } 74: public string getValue(int index) { 75: SaxAttribute saxAtt = (SaxAttribute)attArray[index]; 76: return saxAtt.Value; 77: } 78: public Attributes TrimArray() { 79: attArray.TrimToSize(); 80: return this; 81: } 82: } 83: }
After all the classes expected by the SAX ContentHandler and ErrorHandler interfaces are created and ready to be used, the actual implementation of the SAX parser can be constructed through using the XmlTextReader class, along with a few others. Listing 3 shows this implementation and is followed by step-by-step details on what the code is doing.
Listing 3Constructing a SAX Parser Using the XmlTextReader Class (SAXParser.cs)
1: namespace XmlParsers.Sax { 2: using System; 3: using System.Xml; 4: using System.Collections; 5: using XmlParsers.Sax.Handlers; 6: using XmlParsers.Sax.Helpers; 7: /// <summary> 8: /// The SaxParser class build a SAX push model 9: /// from the pull model found in the XmlTextReader. 10: /// </summary> 11: public class SaxParser { 12: private ContentHandler Handler = null; 13: private ErrorHandler errorHandler = null; 14: public void setErrorHandler(ErrorHandler errorHandler) { 15: this.errorHandler = errorHandler; 16: } 17: public void setContentHandler(ContentHandler handler) { 18: this.Handler = handler; 19: } 20: public void parse(string url) { 21: int buflen = 500; 22: char[] buffer = new char[buflen]; 23: Stack nsstack = new Stack(); 24: Locator locator = new Locator(); 25: SAXParseException saxException = new SAXParseException(); 26: Attributes atts; 27: XmlTextReader reader = null; 28: try { 29: reader = new XmlTextReader(url); 30: object nsuri = reader.NameTable.Add( 31: "http://www.w3.org/2000/xmlns/"); 32: Handler.startDocument(); 33: while (reader.Read()) { 34: int len; 35: string prefix; 36: locator.LineNumber = reader.LineNumber; 37: locator.ColumnNumber = reader.LinePosition; 38: Handler.setDocumentLocator(locator); 39: switch (reader.NodeType) { 40: case XmlNodeType.Element: 41: nsstack.Push(null);//marker 42: atts = new Attributes(); 43: while (reader.MoveToNextAttribute()) { 44: if (reader.NamespaceURI.Equals(nsuri)) { 45: prefix = ""; 46: if (reader.Prefix == "xmlns") { 47: prefix = reader.LocalName; 48: } 49: nsstack.Push(prefix); 50: Handler.startPrefixMapping(prefix, 51: reader.Value); 52: } else { 53: SaxAttribute newAtt = 54: new SaxAttribute(); 55: newAtt.Name = reader.Name; 56: newAtt.NamespaceURI = 57: reader.NamespaceURI; 58: newAtt.Value = reader.Value; 59: atts.attArray.Add(newAtt); 60: } 61: } 62: reader.MoveToElement(); 63: Handler.startElement(reader.NamespaceURI, 64: reader.LocalName, reader.Name, 65: atts.TrimArray()); 66: if (reader.IsEmptyElement) { 67: Handler.endElement(reader.NamespaceURI, 68: reader.LocalName, reader.Name); 69: } 70: break; 71: case XmlNodeType.EndElement: 72: Handler.endElement(reader.NamespaceURI, 73: reader.LocalName, reader.Name); 74: while (prefix != null) { 75: Handler.endPrefixMapping(prefix); 76: prefix = (string)nsstack.Pop(); 77: } 78: break; 79: case XmlNodeType.Text: 80: while ((len = 81: reader.ReadChars(buffer, 0, buflen))>0) { 82: Handler.characters(buffer, 0, len); 83: } 84: //After read you are automatically put 85: //on the next tag so you have to 86: //call the proper case from here. 87: if (reader.NodeType == XmlNodeType.Element) { 88: goto case XmlNodeType.Element; 89: } 90: if (reader.NodeType == 91: XmlNodeType.EndElement) { 92: goto case XmlNodeType.EndElement; 93: } 94: break; 95: case XmlNodeType.ProcessingInstruction: 96: Handler.processingInstruction(reader.Name, 97: reader.Value); 98: break; 99: case XmlNodeType.Whitespace: 100: char[] whiteSpace = 101: reader.Value.ToCharArray(); 102: Handler.ignorableWhitespace(whiteSpace,0,1); 103: break; 104: case XmlNodeType.Entity: 105: Handler.skippedEntity(reader.Name); 106: break; 107: } 108: } //While 109: Handler.endDocument(); 110: } //try 111: catch (Exception exception) { 112: saxException.LineNumber = reader.LineNumber.ToString(); 113: saxException.SystemID = ""; 114: saxException.Message = 115: exception.GetBaseException().ToString(); 116: errorHandler.error(saxException); 117: } 118: finally { 119: if (reader.ReadState != ReadState.Closed) { 120: reader.Close(); 121: } 122: } 123: } //parse() 124: } //SAXParser 125: } //namespace
Let's take a step-by-step look at what is happening in the code.
Step 1: Referencing Assemblies
The namespace XmlParsers.Sax is declared and several assemblies are made available to the SAXParser class, including the System.Xml assembly shown in line 3. The helper classes are also referenced (XmlParser.Sax.Helpers), along with the necessary content handlers (XmlParser.Sax.Handlers).
1: namespace XmlParsers.Sax { 2: using System; 3: using System.Xml; 4: using System.Collections; 5: using XmlParsers.Sax.Handlers; 6: using XmlParsers.Sax.Helpers;
Step 2: Setting the Handlers
The setContentHandler() methods take care of letting the SAXParser class know which objects to pass information to as the stream of XML is read or when an error occurs.
12: private ContentHandler Handler = null; 13: private ErrorHandler errorHandler = null; 14: public void setErrorHandler(ErrorHandler errorHandler) { 15: this.errorHandler = errorHandler; 16: } 17: public void setContentHandler(ContentHandler handler) { 18: this.Handler = handler; 19: }
Step 3: Declaring the XmlTextReader Class
After the ContentHandler and ErrorHandler classes are set, the parse() method of the SAXParser class can be called with the path to the XML document being passed in as a parameter. Within the parse() method, several objects are created, including those necessary to handle buffering (lines 2122), namespace prefixes (line 23), the ContentHandler Locator and Attributes objects (line 24, line 26), and the ErrorHandler SAXParseException object (line 25). Finally, line 27 shows an XmlTextReader object named reader being set to null.
20: public void parse(string url) { 21: int buflen = 500; 22: char[] buffer = new char[buflen]; 23: Stack nsstack = new Stack(); 24: Locator locator = new Locator(); 25: SAXParseException saxException = new SAXParseException(); 26: Attributes atts; 27: XmlTextReader reader = null;
Step 4: Instantiating the XmlTextReader Class
The next section of code begins the try portion of a try/catch/finally block and then takes care of instantiating the XmlTextReader class. Notice that the path to the XML document that will be parsed is passed into the constructor. This path was originally passed into the parse() method as a string (shown in Step 3). Line 31 shows the creation of an XmlNameTable object as well. In cases where you will do numerous comparisons between a particular string value and an XML token returned by the XmlTextReader during parsing, you'll want to consider using an XmlNameTable. This is because object comparisons can be done during the parsing process that offer performance benefits over regular string comparisons. In Step 6, you'll learn more about how the XmlNameTable object named nsuri (line 30) is used.
28: try { 29: reader = new XmlTextReader(url); 30: object nsuri = reader.NameTable.Add( 31: "http://www.w3.org/2000/xmlns/");
Step 5: Reading from the Stream
Assuming the XML document identified by the file path passed into the constructor of the XmlTextReader is found, the stream of XML tokens are now ready to be read. If the file is not found, the catch block will be hit, which will cause the ErrorHandler to be called (shown later). Before the stream reading begins, however, the ContentHandler's startDocument() method is called to let it know that parsing is about to begin (line 32). After the ContentHandler has been notified, the process of reading the stream is started by calling the XmlTextReader's Read() method from within a while statement, as shown in line 33. This method will return true as long as the end of the stream has not been reached.
Lines 3638 show how information about the reader's position within the stream can be passed to the ContentHandler. After this information has been passed, the bulk of the parsing work begins as the reader's NodeType property is checked within a switch statement (line 39).
32: Handler.startDocument(); 33: while (reader.Read()) { 34: int len; 35: string prefix; 36: locator.LineNumber = reader.LineNumber; 37: locator.ColumnNumber = reader.LinePosition; 38: Handler.setDocumentLocator(locator); 39: switch (reader.NodeType) {
Step 6: Checking for Element Nodes
Now that we're within the switch statement, each XML token read from the stream can be checked against the XmlNodeType enumeration. The case statement shown in line 40 will be hit if an element type node is found in the stream. Because element nodes are the main structural component of an XML document, the bulk of the parsing work will be done within this particular case.
Line 41 starts things off by creating a null entry into a stack object named nsstack. This object will be used to track namespace prefixes that may be found on a given element so that the ContentHandler's startPrefixMapping() method can pass this information (shown in lines 4551). Line 42 then creates a SAX helper class named Attributes that is used to hold any attributes found on a given element. To enumerate through any existing attributes, the XmlTextReader's MoveToNextAttribute() method can be called as shown in line 43. This method returns a value of true each time an attribute is found. Lines 5359 take care of adding each attribute's Name, NamespaceURI, and Value to the Attributes object collection. Notice that as the attributes are being enumerated through, the NamespaceURI property of the reader object is being compared to the XmlNameTable object named nsuri (line 44). As mentioned earlier, this object-to-object comparison results in performance gains over simply comparing strings, as shown next:
if (reader.NamespaceURI == "http://www.w3.org/2000/xmlns/") { ........ }
After all attributes has been enumerated through (assuming some existed in the first place), the XmlTextReader's MoveToElement() method is called to get back on the original element. Now that the necessary information about the element has been gathered, the ContentHandler's startElement() method is called and the objects detailed earlier along with information about the element itself are passed in as arguments (lines 6365). After this is completed, a check is made to see whether the element's content model is empty (line 66). If it is, the endElement() method is called to let the ContentHandler know that the parser will be moving on to other nodes within the XML document.
40: case XmlNodeType.Element: 41: nsstack.Push(null);//marker 42: atts = new Attributes(); 43: while (reader.MoveToNextAttribute()) { 44: if (reader.NamespaceURI.Equals(nsuri)) { 45: prefix = ""; 46: if (reader.Prefix == "xmlns") { 47: prefix = reader.LocalName; 48: } 49: nsstack.Push(prefix); 50: Handler.startPrefixMapping(prefix, 51: reader.Value); 52: } else { 53: SaxAttribute newAtt = 54: new SaxAttribute(); 55: newAtt.Name = reader.Name; 56: newAtt.NamespaceURI = 57: reader.NamespaceURI; 58: newAtt.Value = reader.Value; 59: atts.attArray.Add(newAtt); 60: } 61: } 62: reader.MoveToElement(); 63: Handler.startElement(reader.NamespaceURI, 64: reader.LocalName, reader.Name, 65: atts.TrimArray()); 66: if (reader.IsEmptyElement) { 67: Handler.endElement(reader.NamespaceURI, 68: reader.LocalName, reader.Name); 69: } 70: break;
Step 7: Checking for End Element Nodes
The next case statement simply checks whether the current XML Token being read from the stream is of type XmlNodeType.EndElement. As with empty elements, an end element that is found in the stream will cause the ContentHandler's endElement() method to be called so that it knows the current element has been completely processed. This section of code also takes care of reading off of the stack object nsstack and calling endPrefixMapping()(lines 7477).
71: case XmlNodeType.EndElement: 72: Handler.endElement(reader.NamespaceURI, 73: reader.LocalName, reader.Name); 74: while (prefix != null) { 75: Handler.endPrefixMapping(prefix); 76: prefix = (string)nsstack.Pop(); 77: } 78: break;
Step 8: Reading Text Nodes
Although the capability to handle element nodes (both start and end elements) is certainly useful, getting to the text contained within these elements is a necessity if the data is to be of any use. Fortunately, the XmlTextReader makes it easy to handle text nodes by using the ReadChars() method. Lines 8083 show how this method is used. First, notice that the ReadChars() method accepts several arguments, including the buffer to write characters to (appropriately named buffer in this case), the position within the buffer to start writing to, and the number of characters to write into the buffer. If you remember back to Step 3 (lines 21 and 22) the size of the buffer along with the actual buffer itself were defined. In this example, the size of buflen is set to 500. In cases where a text node is greater than 500 characters, the while loop will take care of filling the buffer repeatedly with 500 characters at a time until all characters within the text node have been handled.
After reading the characters, the XmlTextReader automatically places itself on the next XML token within the stream. This means that you do not explicitly have to tell it to move, because it does the moving for you. As a result of this behavior, lines 8793 take care of checking what the new node type is that the reader is positioned on and calls the proper case statement depending on the node type. If these statements were omitted, several start or end elements could be skipped and left unprocessed.
79: case XmlNodeType.Text: 80: while ((len = 81: reader.ReadChars(buffer, 0, buflen))>0) { 82: Handler.characters(buffer, 0, len); 83: } 84: //After read you are automatically put 85: //on the next tag so you have to 86: //call the proper case from here. 87: if (reader.NodeType == XmlNodeType.Element) { 88: goto case XmlNodeType.Element; 89: } 90: if (reader.NodeType == 91: XmlNodeType.EndElement) { 92: goto case XmlNodeType.EndElement; 93: } 94: break;
Step 9: Handling Processing Instructions, Whitespace, and Entities
As processing instruction, whitespace, and entity node types are encountered during the parsing process, each is passed to the appropriate method within the ContentHandler, as shown in the following code. As you look through the code, you'll see that this process is fairly straightforward.
95: case XmlNodeType.ProcessingInstruction: 96: Handler.processingInstruction(reader.Name, 97: reader.Value); 98: break; 99: case XmlNodeType.Whitespace: 100: char[] whiteSpace = 101: reader.Value.ToCharArray(); 102: Handler.ignorableWhitespace(whiteSpace,0,1); 103: break; 104: case XmlNodeType.Entity: 105: Handler.skippedEntity(reader.Name); 106: break;
Step 10: Ending the Parsing Process and Catching Errors
If the XML document is parsed completely by the XmlTextReader, the endDocument() method is called on the ContentHandler (line 109). If an error arises during the parsing process, such as the XML document not being found or not being well-formed, the error will be caught by the catch block shown in lines 111117. This block takes care of calling the ErrorHandler object so that the error can be reported to the SAX application.
After the parsing process has completed, the finally block will be called. This block takes care of ensuring that the XmlTextReader is not already closed through checking the ReadState property.
109: Handler.endDocument(); 110: } //try 111: catch (Exception exception) { 112: saxException.LineNumber = reader.LineNumber.ToString(); 113: saxException.SystemID = ""; 114: saxException.Message = 115: exception.GetBaseException().ToString(); 116: errorHandler.error(saxException); 117: } 118: finally { 119: if (reader.ReadState != ReadState.Closed) { 120: reader.Close(); 121: } 122: }
Step 11: Calling the SAX Parser from an ASP.NET Page
After reading through all the code explanations in the previous 10 steps, you should have a good idea of what features the XmlTextReader class has to offer. The only thing left to show is how an ASP.NET page can instantiate and use the SAXParser class and its associated handler classes to parse an XML document. Listing 4 shows the segment of code from a file named SAXTest.aspx that instantiates the SAXParser class.
Listing 4Using the SAXParser Class from an ASP.NET Page
1: public void StartSAX() { 2: Response.Write("<b>Starting SAX Parsing....</b><p />"); 3: SaxParser parser = new SaxParser(); 4: ContentHandler handler = new ContentHandler(Request,Response); 5: ErrorHandler errorHandler = new ErrorHandler(Request,Response); 6: parser.setContentHandler(handler); 7: parser.setErrorHandler(errorHandler); 8: try { 9: parser.parse(Server.MapPath("SAXTest.xml")); 10: } 11: catch (Exception exp) { 12: Response.Write(exp.ToString()); 13: } 14: }
The ContentHandler class in this example receives data about different nodes in the XML document and then writes the results out to the browser. However, the same code (with minor modifications) could be used to update a database, text file, or other data store. In concluding the discussion on creating a .NET version of the SAX parser by using the XmlTextReader class, it's important to keep in mind that it is always recommended that you use the XmlTextReader directly in your ASP.NET applications. It is engineered to provide fast and efficient support of XML document parsing.