Converting Flat Files to XML
Relatively little of the world's data is currently stored in XML. Much of it is stored in flat files as tab-delimited text, comma-separated values, or some similar format. More is locked up in databases of one kind or another, whether relational, hierarchical, or object based. Even more is hidden inside unstructured documents, including Microsoft Word files, HTML documents, and plain text. XML tools are not suitable for working with any of this.
There are no magic bullets that will convert all of your data to semantically tagged XML. There are a few specialized programs that convert certain formats such as Word documents to particular XML applications such as XHTML. However, the output from even the best of these tools often needs to be cleaned up by hand. How much clean-up work you need to do generally depends on how structured the data format is to start with and how clean the data is. It's relatively easy to encode a relational table from a DB2 database as XML because it already has a lot of structure and a mandatory schema. It's a lot harder in practice to convert tab-delimited text files because those tend to be full of mistakes and dirty data. Records are missing fields. Fields get swapped with each other. A field that is supposed to contain a number between 1 and 12 may contain a list of foodstuffs the data entry clerk was supposed to buy on his way home one day. All of these things can and do happen, and you have to account for them regardless of what you're doing with such data, whether that's converting it to XML or summarizing it for an annual report.
When you're tasked with converting legacy data to XML, you just have to roll up your sleeves and attack the problem. You need to understand the current structure of the data. You need to write a program that reads the input format and writes out XML. You need to debug the inevitable problems that arise when the data in the input isn't exactly you thought it was or what it was supposed to be. By far the hardest part of this problem is parsing the input data, in whatever form it takes. Once you've loaded the data into your program, writing it back out again in XML is a cakewalk.
The Budget
As an example of this process, I'm going to use U.S. federal government budget authorization data, which the Office of Management and Budget (OMB) publishes in a variety of equivalent flat formats, even though the data itself is relatively unflat. This is a good example of the sort of legacy data developers often have to deal with. The complete document [http://w3.access.gpo.gov/usbudget/fy2002/db.html] consists of 3,185 line items. Each line item consists of 43 separate fields. In the comma-separated values (CSV) version of the file, a typical line item looks like this:
"418","National Endowment for the Humanities","00","National Endow- ment for the Humanities","0200","National Endowment for the Humani- ties: grants and administration","59","503","Research and general education aids","Discretionary","On-budget", 0, 0, 0, 121275, 145231, 150100, 151299, 130560, 135447, 140118, 139478, 132582, 138890, 140435, 153000, 156910, 170002, 175955, 177413, 177491, 172000, 110000, 110000, 111000, 112000, 115000, 120000, 121000, 124000, 126000, 129000, 132000
Each field is separated from the following field by a comma. Strings are enclosed in double quotes, and may contain commas that do not delimit fields. Dollar amounts are expressed as integers divided by 1,000. That is, the last value in the above line is 132 million dollars, not 132 thousand dollars. Table 4.1 identifies the 43 separate fields.
Table 4.1. Public Budget Database Field Descriptions
Field Number | Field Name | Description |
1 | Agency code | A unique, three-digit, numerical code for the cabinet department or independent agency. |
2 | Agency name | The name of the cabinet-level department (e.g., Department of Defense) or independent agency (e.g., Peace Corps). Even though only the executive branch has true agencies, for purposes of the budget, offices within the legislative branch are given the agency name “Legislative Branch” and offices within the judicial branch are given the agency name “Judicial Branch.” Agency names have a maximum of 89 characters. |
3 | Bureau code | A two-digit numerical code for the bureau within the cabinet department or independent agency; bureau codes are unique only within an agency. |
4 | Bureau name | The name of the suboffice within the agency (e.g., Coast Guard or Federal Aviation Administration). Budget amounts for the agency as a whole that are not part of a specific bureau are generally categorized under a fictional bureau with the same name as the agency. Bureau names have a maximum of 89 characters. |
5 | Account code | A four-digit code (outlays) or six-digit code (offsetting receipts) for the account within the bureau. |
6 | Account name | The name of the budgeted function within the bureau. Account names have a maximum of 160 characters. |
7 | Treasury Agency code | A two-digit numerical code for the agency, assigned by the Treasury Department. |
8 | Subfunction code | A three-digit numerical code for the subfunction within an account. |
9 | Subfunction title | The name of the subfunction within the account. Subfunction titles have a maximum of 72 characters. |
10 | BEA category | Budget Enforcement Act category: “Mandatory,” “Discretionary,” or “Net interest.” |
11 | On- and off-budget indicator | “On-budget” or “Off-budget”; Social Security trust funds and the Postal Service are off-budget; all other accounts are on-budget. |
12 | 1976 value | Actual amounts, in thousands of dollars, for fiscal year (FY) 1976. Budget authority is usually shown as a positive value. Offsetting receipts are usually negative values. |
13 | TQ value | Actual amount, in thousands of dollars, for the “transitional quarter” in 1976 when the government shifted the start of its fiscal year ahead from July to October. |
14–37 | 1977–2000 | Actual amounts, in thousands of dollars, for each fiscal year from 1977 to 2000. |
38–end | 2001–2006 | Estimated amounts, in thousands of dollars, for FY 2001 through FY 2006. |
Source: Adapted from U.S. Office of Management and Budget. Budget Analysis Branch. Public Budget Database User's Guide: Budget of the United States Government, Fiscal Year 2002 [http://a257.gakamaitech.net/7/257/2422/09apr20010800/www.gpo.gov/usbudget/FY2002/pdf/db_guide.pdf], April 9, 2001, 8. |