Processing XML with Java: Reading XML
Writing XML documents is very straightforward, as I hope Chapters 3 and 4 proved. Reading XML documents is not nearly as simple. Fortunately, you don't have to do all the work yourself; you can use an XML parser to read the document for you. The XML parser exposes the contents of an XML document through an API, which a client application then reads. In addition to reading the document and providing the contents to the client application, the parser checks the document for well-formedness and (optionally) validity. If it finds an error, it informs the client application.
InputStreams and Readers
It's time to reverse the examples of Chapters 3 and 4. Instead of putting information into an XML document, I'm going to take information out of one. In particular, I'm going to use an example that reads the response from the Fibonacci XML-RPC servlet introduced in Chapter 3. This document takes the form shown in Example 5.1.
Example 5.1 A Response from the Fibonacci XML-RPC Server
<?xml version="1.0"?> <methodResponse> <params> <param> <value><double>28657</double></value> </param> </params> </methodResponse>
The clients for the XML-RPC server developed in Chapter 3 simply printed the entire document on the console. Now I want to extract just the answer and strip out all of the markup. In this situation, the user interface will look something like this:
C:\XMLJAVA>java"FibonacciClient"9 34
From the user's perspective, the XML is completely hidden. The user neither knows nor cares that the request is being sent and the response is being received in an XML document. Those are merely implementation details. In fact, the user may not even know that the request is being sent over the network rather than being processed locally. All the user sees is the very basic command line interface. Obviously you could attach a fancier GUI front end, but this is not a book about GUI programming, so I'll leave that as an exercise for the reader.
Given that you're writing a client to talk to an XML-RPC server, you know that the documents you're processing always take this form. You know that the root element is methodResponse. You know that the methodResponse element contains a single params element that in turn contains a param element. You know that this param element contains a single value element. (For the moment, I'm going to ignore the possibility of a fault response to keep the examples smaller and simpler. Adding a fault response would be straightforward, and we'll do that in later chapters.) The XML-RPC specification specifies all of this. If any of it is violated in the response you get back from the server, then that server is not sending correct XML-RPC. You'd probably respond to this by throwing an exception.
Given that you're writing a client to talk to the specific servlet at http://www.elharo.com/fibonacci/XML-RPC, you know that the value element contains a single double element that in turn contains a string representing a double. This isn't true for all XML-RPC servers, but it is true for this one. If the server returned a value with a type other than double, you'd probably respond by throwing an exception, just as you would if a local method you expected to return a Double instead returned a String. The only significant difference is that in the XML-RPC case, neither the compiler nor the virtual machine can do any type checking. Thus you may want to be a bit more explicit about handling a case in which something unexpected is returned.
The main point is this: Most programs you write are going to read documents written in a specific XML vocabulary. They will not be designed to handle absolutely any well-formed document that comes down the pipe. Your programs will make assumptions about the content and structure of those documents, just as they now make assumptions about the content and structure of external objects. If you are concerned that your assumptions may occasionally be violated (and you should be), then you can validate your documents against a schema of some kind so you know up front if you're being fed bad data. However, you do need to make some assumptions about the format of your documents before you can process them reasonably.
It's simple enough to hook up an InputStream and/or an InputStreamReader to the document, and read it out. For example, the following method reads an input XML document from the specified input stream and copies it to System.out:
public printXML(InputStream xml) { int c; while ((c = xml.read()) != -1) System.out.write(c); }
To actually extract the information, a little more work is required. You need to determine which pieces of the input you actually want and separate those out from all the rest of the text. In the Fibonacci XML-RPC example, you need to extract the text string between the <double> and </double> tags and then convert it to a java.math.BigInteger object. (Remember, I'm using a double here only because XML-RPC's ints aren't big enough to handle Fibonacci numbers. However, all the responses should contain an integral value.)
The readFibonacciXMLRPCResponse() method in Example 5.2 does exactly this by first reading the entire XML document into a StringBuffer, converting the buffer to a String, and then using the indexOf() and substring() methods to extract the desired information. The main() method connects to the server using the URL and URLConnection classes, sends a request document to the server using the OutputStream and OutputStreamWriter classes, and passes InputStream containing the response XML document to the readFibonacciXMLRPCResponse() method.
Example 5.2 Reading an XML-RPC Response
import java.net.*; import java.io.*; import java.math.BigInteger; public class FibonacciClient { static String defaultServer = "http://www.elharo.com/fibonacci/XML-RPC"; public static void main(String[] args) { if (args.length <= 0) { System.out.println( "Usage: java FibonacciClient number url" ); return; } String server = defaultServer; if (args.length >= 2) server = args[1]; try { // Connect to the server URL u = new URL(server); URLConnection uc = u.openConnection(); HttpURLConnection connection = (HttpURLConnection) uc; connection.setDoOutput(true); connection.setDoInput(true); connection.setRequestMethod("POST"); OutputStream out = connection.getOutputStream(); Writer wout = new OutputStreamWriter(out); // Write the request wout.write("<?xml version=\"1.0\"?>\r\n"); wout.write("<methodCall>\r\n"); wout.write( " <methodName>calculateFibonacci</methodName>\r\n"); wout.write(" <params>\r\n"); wout.write(" <param>\r\n"); wout.write(" <value><int>" + args[0] + "</int></value>\r\n"); wout.write(" </param>\r\n"); wout.write(" </params>\r\n"); wout.write("</methodCall>\r\n"); wout.flush(); wout.close(); // Read the response InputStream in = connection.getInputStream(); BigInteger result = readFibonacciXMLRPCResponse(in); System.out.println(result); in.close(); connection.disconnect(); } catch (IOException e) { System.err.println(e); } } private static BigInteger readFibonacciXMLRPCResponse( InputStream in) throws IOException, NumberFormatException, StringIndexOutOfBoundsException { StringBuffer sb = new StringBuffer(); Reader reader = new InputStreamReader(in, "UTF-8"); int c; while ((c = in.read()) != -1) sb.append((char) c); String document = sb.toString(); String startTag = "<value><double>"; String endTag = "</double></value>"; int start = document.indexOf(startTag) + startTag.length(); int end = document.indexOf(endTag); String result = document.substring(start, end); return new BigInteger(result); } }
Reading the response XML document is more work than writing the request document, but still plausible. This stream-and string-based solution is far from robust, however, and will fail if any one of the following conditions is present:
The document returned is encoded in UTF-16 instead of UTF-8.
An earlier part of the document contains the text “<value><double>,” even in a comment.
The response is written with line breaks between the value and double tags, like this:
<value> <double>28657</double> </value>
There's extra white space inside the double tags, like this:
<double >28657</double >
Perhaps worse than these potential pitfalls are all the malformed responses FibonacciClient will accept, even though it should recognize and reject them. And this is a simple example in which we just want one piece of data that's clearly marked up. The more data you want from an XML document, and the more complex and flexible the markup, the harder it is to find using basic string matching or even the regular expressions introduced in Java 1.4.
Straight text parsing is not the appropriate tool with which to navigate an XML document. The structure and semantics of an XML document are encoded in the document's markup, its tags, and its attributes; and you need a tool that is designed to recognize and understand this structure as well as reporting any possible errors in this structure. The tool you need is called an XML parser.