Alternatives to Java
When all you have is a hammer, most problems look a lot like nails. Since you're reading this book, I'm willing to bet that Java is your hammer of choice, and indeed Java is a very powerful hammer. But sometimes you really could use a screwdriver, and this may be one of those times. I must admit that the solution for imposing hierarchy developed in the last section feels more than a little like pounding a screw with a hammer. Maybe it would be better to use the hammer to set the screw, but then use a screwdriver to drive it in. In this section I want to explore a few possible screwdrivers, including XSLT and XQuery. Rather than using such complex Java code, I'll do the following: First I'll use Java to get the data into the same simple XML format as that produced by Example 4.2, which closely matches the flat input data. Then I'll use XSLT to transform this simple intermediate XML format into the less flat final XML format. To refresh your memory, the flat XML data is organized like this:
<?xml version="1.0"?> <Budget> <LineItem> <FY1994>-1982</FY1994> <FY1993>4946</FY1993> <FY1992>-3251</FY1992> <FY1991>-17373</FY1991> <FY1990>-90008</FY1990> <AccountCode>265197</AccountCode> <On-Off-BudgetIndicator>On-budget</On-Off-BudgetIndicator> <TransitionQuarter>0</TransitionQuarter> <FY1989>-80069</FY1989> <AccountName>Sale of scrap and salvage materials</AccountName> <FY1988>-72411</FY1988> <FY1987>-60964</FY1987> <FY1986>-61462</FY1986> <FY1985>-68182</FY1985> <FY1984>-79482</FY1984> <FY1983>0</FY1983> <FY1982>0</FY1982> <SubfunctionCode>051</SubfunctionCode> <FY1981>0</FY1981> <FY2006>-1000</FY2006> <FY1980>0</FY1980> <FY2005>-1000</FY2005> <FY2004>-1000</FY2004> <FY2003>-1000</FY2003> <FY2002>-1000</FY2002> <FY2001>-1000</FY2001> <FY2000>-2000</FY2000> <AgencyCode>007</AgencyCode> <BEACategory>Mandatory</BEACategory> <FY1979>0</FY1979> <FY1978>0</FY1978> <FY1977>0</FY1977> <FY1976>0</FY1976> <TreasuryAgencyCode>97</TreasuryAgencyCode> <AgencyName>Department of Defense--Military</AgencyName> <BureauCode>00</BureauCode> <BureauName>Department of Defense--Military</BureauName> <FY1999>-1000</FY1999> <FY1998>-2000</FY1998> <FY1997>-4000</FY1997> <FY1996>-1000</FY1996> <SubfunctionTitle>Department of Defense-Military </SubfunctionTitle> <FY1995>-1000</FY1995> </LineItem> <!-- several thousand more LineItem elements... --> </Budget>
Imposing Hierarchy with XSLT
The XSLT stylesheet shown in Example 4.12 will convert flat XML budget data of this type into an output document of the same form produced by Example 4.11. Because the input file is so large, you may need to raise the memory allocation for your XSLT processor before running the transform.
Example 4.12 An XSLT Stylesheet That Converts Flat XML Data to Hierarchical XML Data
<?xml version="1.0" encoding="ISO-8859-1"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <!-- Try to make the output look half decent --> <xsl:output indent="yes" encoding="ISO-8859-1"/> <!-- Muenchian method --> <xsl:key name="agencies" match="LineItem" use="AgencyCode"/> <xsl:key name="bureaus" match="LineItem" use="concat(AgencyCode,'+',BureauCode)"/> <xsl:key name="accounts" match="LineItem" use="concat(AgencyCode,'+',BureauCode,'+',AccountCode)"/> <xsl:key name="subfunctions" match="LineItem" use="concat(AgencyCode,'+',BureauCode,'+',AccountCode, '+',SubfunctionCode)"/> <xsl:template match="Budget"> <Budget year='2001'> <xsl:for-each select="LineItem[generate-id() = generate-id(key('agencies',AgencyCode)[1])]"> <Agency> <Name><xsl:value-of select="AgencyName"/></Name> <Code><xsl:value-of select="AgencyCode"/></Code> <xsl:for-each select="/Budget/LineItem[AgencyCode =current()/AgencyCode] [generate-id() = generate-id(key('bureaus', concat(AgencyCode, '+', BureauCode))[1])]"> <Bureau> <Name><xsl:value-of select="BureauName"/></Name> <Code><xsl:value-of select="BureauCode"/></Code> <xsl:for-each select="/Budget/LineItem [AgencyCode=current()/AgencyCode] [BureauCode=current()/BureauCode] [generate-id() = generate-id(key('accounts', concat(AgencyCode,'+',BureauCode,'+', AccountCode))[1])]"> <Account> <Name> <xsl:value-of select="AccountName"/> </Name> <Code> <xsl:value-of select="AccountCode"/> </Code> <xsl:for-each select= "/Budget/LineItem [AgencyCode=current()/AgencyCode] [BureauCode=current()/BureauCode] [AccountCode=current()/AccountCode] [generate-id()=generate-id( key('subfunctions' concat(AgencyCode,'+', BureauCode,'+',AccountCode,'+', SubfunctionCode))[1])]"> <Subfunction BEACategory="{BEACategory}" BudgetIndicator="{On-Off-BudgetIndicator}"> <Title> <xsl:value-of select="SubfunctionTitle"/> </Title> <Code> <xsl:value-of select="SubfunctionCode"/> </Code> <Amount> <xsl:value-of select="FY2001"/> </Amount> </Subfunction> </xsl:for-each> </Account> </xsl:for-each> </Bureau> </xsl:for-each> </Agency> </xsl:for-each> </Budget> </xsl:template> </xsl:stylesheet>
The algorithm for converting flat data to hierarchical data with XSLT is known as the Muenchian method after its inventor, Steve Muench of Oracle. The trick of the Muenchian method is to use the xsl:key element and the key() function to create node sets of all the LineItem elements that share the same agency, bureau, account, or subfunction. Inside the template, the generate-id() function is used to compare the current node to the first node in any given group. Output is generated only if we are indeed processing the first Agency, Bureau, Account, or Subfunction element with a specified code. Also note, that the select attributes in the xsl:for-each elements keep returning to the root rather than processing children and descendants as is customary. This reflects the fact that the hierarchy in the input is not the same as the hierarchy in the output.
One minor advantage of using XSLT instead of Java data structures is that XSLT preserves the order of the input data. You'll notice that the output begins with the Legislative Branch agency, bureau, and Receipts, Central fiscal operations account—the same as the input data does. This was not the case for the output produced by Java.
Note
XSLT 2.0 will make it much easier to write stylesheets that group elements in this fashion. This will likely involve a new xsl:for-each-group element that groups elements according to an XPath expression, and a current-group() function that selects all members of the current group so that they can be processed together.
<?xml version="1.0" encoding="ISO-8859-1"?> <Budget year="2001"> <Agency> <Name>Legislative Branch</Name> <Code>001</Code> <Bureau> <Name>Legislative Branch</Name> <Code>00</Code> <Account> <Name>Receipts, Central fiscal operations</Name> <Code/> <Subfunction BEACategory="Mandatory" BudgetIndicator="On-budget"> <Title>Central fiscal operations</Title> <Code>803</Code> <Amount>0</Amount> </Subfunction> <Subfunction BEACategory="Net interest" BudgetIndicator="On-budget"> <Title>Other interest</Title> <Code>908</Code> <Amount>0</Amount> </Subfunction> </Account> <Account> <Name>Charges for services to trust funds</Name> ...
The XML Query Language
XSLT is Turing complete. Nonetheless, some operations are more than a little cumbersome in XSLT. XSLT's inventors definitely did not envision using the Muenchian method to impose hierarchy. The W3C has begun work on a language more suitable for querying XML documents, called, simply enough, the XML Query Language, or XQuery for short. XQuery is to XML documents what SQL is to relational tables. However, XQuery is limited to SELECT. It has no equivalent of INSERT, UPDATE, or DELETE. It is a read-only language.
Caution
This section describes bleeding-edge technology. The broad picture presented here is likely to be correct, but the details are almost certain to change. Furthermore, the exact subset of XQuery implemented by early experimental tools varies significantly from one product to the next.
XQuery queries are not in general well-formed XML. Although there is an XML syntax for XQuery, it is not intended to be used by human beings. Instead humans are supposed to write in a more natural 4GL syntax, which will be compiled into XML documents if necessary. If you think about it, this shouldn't be so surprising: SQL statements aren't tables. Why should XQuery statements be XML documents?
The basic nature of an XQuery query is the FLWR (pronounced “flower”) statement. FLWR is the acronym for for-let-where-return, the basic form of an XQuery query. In brief, for each node in a node set, let a variable have a certain value, where some condition is true, and return an XML fragment based on the values of these variables. Variables are set and XML is returned using XPath 2.0 expressions.
For example, here's an XQuery that generates a list of agency names from the flat XML budget:
for $name in document("budauth.xml")/Budget/LineItem/AgencyName return $name
The for clause iterates over every node in the node set returned by the XPath 2.0 expression document("budauth.xml")/Budget/LineItem/AgencyName. This expression returns a node set containing 3,175 AgencyName elements. The XQuery variable $name is set to each of these elements in turn. The return clause is evaluated for each value of $name. In this case, the return clause says simply to return the node to which the $name variable currently points. In this example, the $name variable always points to an AgencyName element; therefore, the output would begin like this:
<AgencyName>Legislative Branch</AgencyName> <AgencyName>Legislative Branch</AgencyName> <AgencyName>Legislative Branch</AgencyName> <AgencyName>Legislative Branch</AgencyName> <AgencyName>Legislative Branch</AgencyName> ...
This is not a well-formed XML document because it does not have a root element. However, it is a well-formed XML document fragment.
You can use the XPath 2.0 distinct-values() function around the XPath expression to select only one of each AgencyName element:
for $name in distinct-values(document("budauth.xml")/Budget/ LineItem/AgencyName) return $name
The output would now begin like this, listing each agency name only once:
<AgencyName>Legislative Branch</AgencyName> <AgencyName>Judicial Branch</AgencyName> <AgencyName>Department of Agriculture</AgencyName> <AgencyName>Department of Commerce</AgencyName> <AgencyName>Department of Defense--Military</AgencyName> ...
As well as copying existing elements, XQuery can create new elements. You can type the tags precisely where you want them to appear. To include the value of a variable (or other expression) inside the tags, enclose it in curly braces. For example, the following query places <Name> and </Name> tags around each agency name, rather than <AgencyName> and </AgencyName>. Notice also that it selects only the text content of each AgencyName element, rather than the complete element node:
for $name in distinct-values( document("budauth.xml")//AgencyName/text()) return <Name>{$name }</Name>
The output now begins like this:
<Name>Legislative Branch</Name> <Name>Judicial Branch</Name> <Name>Department of Agriculture</Name> <Name>Department of Commerce</Name> <Name>Department of Defense--Military</Name> ...
More complex queries typically require multiple variables. These can be set in a let clause based on XPath expressions that refer to the variable in the for clause. For example, this query selects distinct agency codes but returns agency names:
for $code in distinct-values(document("budauth.xml")//AgencyCode) let $name := $code/../AgencyName return $name
A where clause can further restrict the members of the node set for which results are generated. where conditions can use boolean connectors such as and, or, and not(). For example, this query finds all the bureaus in the Department of Agriculture:
for $bureau in distinct-values(document("budauth.xml")/Budget/ LineItem/BureauName) where $bureau/../AgencyName = "Department of Agriculture" return $bureau
XQuery expressions can nest. That is, the return statement of the FLWR may contain another FLWR. For example, this statement lists all the bureau names inside their respective agencies:
for $ac in distinct-values(document("budauth.xml")//AgencyCode) return <Agency> <Name>{$ac/../AgencyName/text() }</Name> { for $bc in distinct-values(document("budauth.xml")//BureauCode) where $bc/../AgencyCode = $ac return <Bureau> <Name>{$bc/../BureauName/text() }</Name> </Bureau> } </Agency>
The output now begins like this:
<Agency> <Name>Legislative Branch</Name> <Bureau> <Name>Legislative Branch</Name> </Bureau> <Bureau> <Name>Senate</Name> </Bureau> <Bureau> <Name>House of Representatives</Name> </Bureau> <Bureau> <Name>Joint Items</Name> </Bureau> ...
This is all the syntax needed to write a query that will convert flat budget data such as that produced by Example 4.2 into a hierarchical XML document. Example 4.13, which selects the data from 2001, demonstrates such a query.
Example 4.13 An XQuery That Converts Flat Data to Hierarchical Data
<Budget year="2001"> { for $ac in distinct-values(document("budauth.xml")//AgencyCode) return <Agency> <Name>{$ac/../AgencyName/text() }</Name> <Code>{$ac/text() }</Code> { for $bc in distinct-values(document("budauth.xml")//BureauCode) where $bc/../AgencyCode = $ac return <Bureau> <Name>{$bc/../BureauName/text() }</Name> <Code>{$bc/text() }</Code> { for $acct in distinct-values( document("budauth.xml")//AccountCode) where $acct/../AgencyCode = $ac AND $acct/../BureauCode = $bc return <Account BEACategory="{$acct/../BEACategory/text() }"> <Name>{$acct/../AccountName/text() }</Name> <Code>{$acct/text() }</Code> { for $sfx in document("budauth.xml")//SubfunctionCode where $sfx/../AgencyCode = $ac and $sfx/../BureauCode = $bc and $sfx/../AccountCode = $acct return <Subfunction> <Title>{$sfx/../SubfunctionTitle/text()}</Title> <Code>{$sfx/text() }</Code> <Amount>{$sfx/../FY2001/text() }</Amount> </Subfunction> } </Account> } </Bureau> } </Agency> } </Budget>
There's a lot more to XQuery, but this should give you an idea of what it can do. It's definitely worth a look any time you need to perform database-like operations on XML documents.