The Basics of XML
If you know HTML already, then you're familiar with the idea of tagging content. Tags are interspersed with data to represent "metadata" or data about the data. Let's start with the following sentence:
Homer's Odyssey is a revered relic of the ancient world.
Imagine you never heard of the Odyssey or Homer. I'll reprint the sentence like this:
Homer's Odyssey is a revered relic of the ancient world.
I've added metadata that adds meaning to the sentence. Just by adding one underline, I've loaded the sentence with extra meaning. In HTML, this sentence would be marked up like this:
Homer's <u>Odyssey</u> is a revered relic of the ancient world.
This markup indicates that the word "Odyssey" is to appear underlined. As described in the last section, HTML is really good only at describing layouta display-oriented markup. If you're interested only in how users are viewing your sentences, that's great. However, if you want to give your documents part of a system, so that they can be managed intelligently and the content within them can be searched, sorted, filed, and repurposed to meet your business needs, you need to know more about them. A human can read the sentence and logically infer that the word "Odyssey" is a book title because of the underline. The sentence contains metadata (that is, the underline), but it's ambiguous to a computer and decodable only by the human reader. Why? Because computers are stupid! If you want a computer to know that "Odyssey" is a book title, you have to be much more explicit; this is where XML comes in. XML markup for the preceding sentence might be the following:
Homer's <book>Odyssey</book> is a revered relic of the ancient world.
Aha! Now we're getting somewhere. The document is marked up using a new tag, <book>, which I've made up just for this application, to indicate where book titles are referenced. This provides two important and powerful tools: You can centrally control the style of your documents, and you have machine-readable metadatathat is, a computer can easily examine your document and tell you where the references to book titles are. You can then choose to style the occurrences of book titles however you wantwith underlines, in italics, in bold, with quotes around them, in a different color, whatever.
Let's say you want every book title you mention to be a hyperlink to a page that enables you to buy the book. The HTML markup would look something like this:
Homer's <u><a href="http://some.store.com/buybook.cgi?ISBN=0987-2343">Odyssey</a></u> is a revered relic of the ancient world.
In this example, you've hard-coded the document with a specific Uniform Resource Locator (URL) to a script on some online bookstore somewhere. What if that bookstore goes out of business? What if you make a strategic partnership with some other online bookstore and you want to change all the book titles to point to that store's pages? Then you've got to go through all of your documents with some kind of half-baked Perl script. What if your documents aren't all coded consistently? There are about a hundred things that can and will go wrong in this scenario. Believe meI've been there.
Let's look at XML markup of the same sentence:
Homer's <book isbn="0987-2343">Odyssey</book> is a revered relic of the ancient world.
Now isn't that a breath of fresh air? By replacing the hard-coded script reference with a simple indication of ISBN (International Standard Book Number, a guaranteed unique number for every book printed1), you've cut the complexity of markup in half. In addition, you have enabled centralized control over whether book titles should be links and, if so, where they link. Assuming central control of how XML documents are turned into display-oriented markup, you can make a change in this one place to effect the display of many documents. As a special bonus, if you store all your XML documents in a database and properly decompose, or extract, the information within them (as we'll discuss next), you can also find out which book titles are referred to from which documents.