18.4 -MSXML Fundamentals
DOM and SAX were mentioned in Chapter 2. DOM Level 2 and SAX2 are both standard APIs for parsing XML, and MSXML supports both standards. In fact, it is possible to use both the DOM and SAX with MSXML to validate XML instances.
18.4.1 Using MSXML from Visual Basic
The first step in using MSXML in Visual Basic is referencing the MSXML components. Like any other COM components you need to use from Visual Basic, you can add a reference to MSXML from your project file. Inside the Visual Basic IDE, you use the Projects menu to add or remove components of the project. One of these project components is a reference to a library. Selecting the Project menu and then the References item from that menu launches the References dialog box. This dialog box provides a checklist of all COM-type libraries on your machine. Checking one of these libraries, such as MSXML 4.0, includes the type information about those components in your project and enables you to create instances of those components and refer to them through early binding. Figure 18.2 shows the References dialog box with MSXML 4.0 loaded.
FIGURE 18.2 Referencing MSXML 4.0 from Visual Basic.
Throughout the chapter, it is assumed that the Visual Basic examples are included in a project that has referenced MSXML 4.0. Any of the stand-alone code samples listed will function in any Visual Basic project that has referenced the MSXML 4.0 components.
18.4.2 Using the DOM
When you are using the MSXML implementation of the DOM Level 2 feature set, the component that represents the DOM Document node is a DOMDocument40 component. Through DOMDocument40, we have complete access to the contents of an XML document.
18.4.2.1 DOMDocument40
DOMDocument40 is the starting point for using the MSXML DOM implementation. The '40' suffix of the component indicates its version. Earlier versions of MSXML included DOMDocument components, and it is possible to declare and use DOMDocument components if you are not concerned with the version of the component you will be using. Because the behavior and feature set of the DOMDocument component has grown over each of the versions, it is often necessary to specifically reference the version you need to guarantee that the functionality you need is available.
To work with a DOMDocument40 component, you need to instantiate an instance of DOMDocument40. Listing 18.1 shows the code required in Visual Basic to create an instance of a DOMDocument40 component if the reference to the MSXML 4.0 library has been added.
LISTING 18.1 Creating a DOMDocument40 (early binding)
Dim doc As DOMDocument40 Set doc = New DOMDocument40
If you are working from a scripting language or other language that only provides for late binding to COM objects, you can create the DOMDocument40 instance by using its PROGID, its user-friendly object identifier. The PROGID for the DOMDocument40 component is 'MSXML2.DOMDocument.4.0'. An example of late binding in VBScript is shown in Listing 18.2.
LISTING 18.2 Creating a DOMDocument40 (late binding for scripting)
Dim doc Set doc = CreateObject("MSXML2.DOMDocument.4.0")
For the purposes of this chapter, assume that we are working inside the Visual Basic or similar IDE and have access to early binding to the MSXML components. This is just to simplify the sample code. The only difference between the two sets of code would be the instantiation code shown in Listings 18.1 and 18.2. As with any COM object, you gain performance at runtime through the use of early binding to the MSXML library.
PROGIDs can be used to create instances of any of the MSXML components. If you need to create an object instance from a scripting language such as VBScript or ECMAScript, use the PROGID of the object. The PROGIDs will follow the same form:
MSXML2.<object name>.4.0
For example, the XMLSchemaCache40 object has a PROGID of 'MSXML2.XMLSchemaCache.4.0'.
18.4.2.2 Reading XML with DOMDocument40
Using DOMDocument40, we can load an XML document and examine its contents after they are loaded into the DOM tree. Two methods of the DOMDocument40 can be used to load an XML document: load and loadXML. The load method, shown in Listing 18.3, accepts a URL that points to an XML document file. Alternatively, the loadXML method accepts the XML content directly, either as a string or as an IStream.
LISTING 18.3 Loading an XML Document Using DOMDocument40
Dim doc As DOMDocument40 Set doc = New DOMDocument40 doc.async = False doc.validateOnParse = False If (doc.Load("c:\test\address.xml")) Then MsgBox "Document is well-formed" Else Dim docError As IXMLDOMParseError Set docError = doc.parseError MsgBox docError.reason, vbCritical End If
Listing 18.3 shows two properties of the DOMDocument40 component that greatly impact the parsing behavior. The first is the async property, which is a Boolean flag indicating whether the document should be loaded synchronously or asynchronously. In the preceding code, we have set it to FALSE so the code will block on the Load method until the document is fully parsed or an error occurs. The default value for this property is TRUE, in which case the application must wait until the DOMDocument40 fires an onreadystatechange event or until the readystate property has changed to indicate a successful load.
The second property is validateOnParse, which is also a Boolean flag. The validateOnParse property defines the behavior you might expect; when TRUE, the parser attempts to validate the document against any XML schemas or XDR. When FALSE, the parser only verifies that the document is well-formed XML. The default value for this property is also TRUE.
The async and validateOnParse properties are not part of the DOM Level 2 feature set; they are specific to the MSXML implementation.
Once we have a DOM tree that can be traversed, which occurs after a successful load, we can use other MSXML components to perform a traversal.
18.4.2.3 -DOM Parsing Errors
Whenever the DOMDocument40 is used to parse an XML document, there is always a chance of error. The variety of errors that could be experienced during parsing varies from the obvious (not well-formed XML) to the complicated (the document did not conform to one of the associated XML schemas). To understand what went wrong during the parsing process and try to rectify it, we need to check the error generated by the parser.
The DOMDocument interface provides access to error information after parsing. Error information is provided through another, separate interface called IXMLDOMParseError. This interface returns information about the error type, the reason, and the location in the document where the error occurred. If validation were to fail, this information would be provided through the parseError property of the DOMDocument interface. The sample code that reads an XML document using DOMDocument40 accesses the parseError property to determine whether or not the document was loaded successfully.
18.4.3 -Using SAX2
The SAX and DOM parsers take a very different approach to working with an XML document. When working with the DOM, we load the XML document using a single object and then examine the tree that is built from the document contents. When working with SAX, we are notified through events whenever the parser encounters a particular element or a particular action occurs. Applications working with the SAX parser must implement handlers for each of these events and connect them to the parser.
Because MSXML is a COM-based API, the notification that comes from the MSXML SAX implementation comes through COM. This means that to create a handler, we must implement the COM interfaces that MSXML expects a SAX handler to implement.
Table 18.1 lists the three handler interfaces currently supported by the MSXML 4.0 SAX implementation, with the types of notifications they receive.
TABLE 18.1 SAX Handler Interfaces
Interface |
Type of Notification |
IVBSAXContentHandler |
Document and elements |
IVBSAXDeclHandler |
DTD declarations |
IVBSAXDTDHandler |
DTD-related events |
IVBSAXErrorHandler |
Errors and warnings |
IVBSAXLexicalHandler |
-Comments and CDATA |
IMXSchemaDeclHandler |
XML schema declarations |
Any handler component can implement as many of these interfaces as needed, and that handler will receive notification about all events related to that category.
18.4.3.1 SAXXMLReader40
The SAXXMLReader40 component is responsible for parsing an XML document and triggering the notifications to the SAX handlers that have registered with it. Triggering the parsing of an XML document takes about as much code as using the DOM and DOMDocument40. The major difference is that in addition to this code, the application must also create any handlers it needs.
18.4.3.2 Reading XML with SAXXMLReader40
The most basic use of the SAX handler would be to sink the events related to content, so a handler interested in those events would need to implement the IVBSAXContentHandler interface. After that handler instance exists, it can be attached to the SAXXMLReader40 component and the parsing can occur.
Listing 18.4 defines a Visual Basic class whose instances function as content handlers. This class is defined as SAXContent, and it implements the IVBSAXContentHandler interface. Whenever an element is parsed, the class checks the local name and conditionally outputs the qualified name of the element by using the Debug.Print statement, the equivalent of a trace statement in other languages.
The code for the handler has a number of empty methods; in fact, it is almost completely empty. The methods are necessary, however, because to implement an interface, you must implement all its methods, even if they are just stubbed out.
LISTING 18.4 Content Handler Class
'SAXContent.cls Implements IVBSAXContentHandler Private Sub IVBSAXContentHandler_characters( _ strChars As String) End Sub Private Property Set IVBSAXContentHandler_documentLocator( _ ByVal RHS As MSXML2.IVBSAXLocator) End Property Private Sub IVBSAXContentHandler_endDocument() End Sub Private Sub IVBSAXContentHandler_endElement(_ strNamespaceURI As String, strLocalName As String, _ strQName As String) End Sub Private Sub IVBSAXContentHandler_endPrefixMapping(_ strPrefix As String) End Sub Private Sub IVBSAXContentHandler_ignorableWhitespace(_ strChars As String) End Sub Private Sub IVBSAXContentHandler_processingInstruction(_ strTarget As String, strData As String) End Sub Private Sub IVBSAXContentHandler_skippedEntity(_ strName As String) End Sub Private Sub IVBSAXContentHandler_startDocument() End Sub Private Sub IVBSAXContentHandler_startElement(_ strNamespaceURI As String, strLocalName As String, _ strQName As String, _ ByVal oAttributes As MSXML2.IVBSAXAttributes) If strLocalName = "businessCustomer" Then Debug.Print "element found: " & strQName End If End Sub Private Sub IVBSAXContentHandler_startPrefixMapping(_ strPrefix As String, strURI As String) End Sub
To use the handler, you need to instantiate an instance of the reader. The SAXXMLReader40, when instantiated, is used to parse the document. The handlers are attached to the reader, using the appropriate property. Several properties of the SAXXMLReader40 are used to attach handlers, as listed in Table 18.2.
Table 18.2 SAXXMLReader40 Handler Properties
Property |
Related Handler Interface |
contentHandler |
IVBSAXContentHandler |
dtdHandler |
IVBSAXDTDHandler |
errorHandler |
IVBSAXErrorHandler |
The code in Listing 18.5 shows how to use your handler class, SAXContent, and SAXXMLReader40 to read the document. The parseURL method of the reader accepts a path to a local XML document file just as the load method of the DOMDocument40 did in Listing 18.3.
LISTING 18.5 Using SAXXMLReader40 with the Content Handler
Dim sax As SAXXMLReader40 Set sax = New SAXXMLReader40 Set sax.contentHandler = New SAXContent sax.parseURL "c:\temp\address.xml"
Unlike with the DOM, you do not need to traverse a tree to find what you are looking for. The call to parseURL begins the parsing; after that, it is up to the event handlers to get the information they need. The SAXXMLReader40 component informs you via a callback to your handler interface whenever a particular event occurs. When loading the thematic address.xml document, the output generated by the code appears as shown in Listing 18.6.
LISTING 18.6 Debug Output of the Content Handler from address.xml
element found: businessCustomer element found: businessCustomer element found: businessCustomer element found: businessCustomer
18.4.3.3. SAXXMLReader40 Configuration
In SAX, there are defined procedures for modifying the XMLReader component. These methods allow the application to set two types of values of the XMLReader: properties and features. Properties are named values that can be set on the reader, and features are Boolean properties. MSXML follows the SAX standard and implements these methods for the SAXXMLReader40.
The XMLReader properties and features are accessed by four methods: getFeature, putFeature, getProperty, and putProperty. SAX defines a set of standard features and properties to be implemented, but these are not the only ones that can be used. This approach means a particular SAX implementation can define its own properties and features that may be proprietary and the standard can grow over time without a change to the interface for every new configuration setting. For example, handlers for content or errors have fixed COM properties that are part of the reader, whereas declaration handlers need to use the getProperty and putProperty to work with the reader. Listing 18.7 illustrates how to set the declaration handler of the SAXXMLReader40 using the putProperty method.
LISTING 18.7 Using putProperty for XMLReader Configuration
Dim sax As SAXXMLReader40 Set sax = New SAXXMLReader40 ' Assume we have a declHandler that implements ' IVBSAXDeclHandler sax.putProperty _ "http://xml.org/sax/properties/declaration-handler", _ declHandler
Whenever you use the SAXXMLReader40 to validate against a specific XML schema, you must use the configuration methods discussed here.
18.4.3.4 -SAX2 Parsing Errors
To handle parsing errors in a SAX application, the application must implement a handler for errors that are fired by the SAX parser. As stated earlier, components that are MSXML SAX error handlers implement the IVBSAXErrorHandler interface. If your application needs to know when the validation of an XML document fails, a component must implement IVBSAXErrorHandler and include code in the fatalerror handler to respond to the failed validation. The XML Schema Tree example at the end of this chapter uses SAX2 and the error handler interface to validate an XML document against an XML schema.