- Representing data digitally
- XML and digital data
- Information systems
- XML and information systems
1.2 XML and digital data
XML is a markup language. Or, rather, it is a way of creating markup languages. What this means, in the terminology of the previous section, is that it is a data model with a standardized notation for serialization. With XML, the notation is what often receives the most attention, and many think that the notation actually is XML. This is partly because this is the only visible form XML data has, which makes it appear more real to many people than the conceptual data model.
As you will see in this book, however, it is the data model that is the most important part, and the syntax is just a method for storing XML and moving it from one place to another. There could also perfectly well be more than one XML syntax reflecting the same data model.9 The important step when representing data as XML is in any case to express it in terms of the XML data model. So if we want to represent the email we saw earlier, we must model its structure using elements and attributes. And if we want, we can then represent that structure using an XML file.
One way to represent emails as XML is to use an element type to represent header fields (which we might call header, with further name and value element types for the header name and value) and then another element type for each attachment (which we might call attachment). The result might look like Example 16.
Example 16. An email in XML syntax
<email> <header> <name>To</name> <value>Lars Marius Garshol <larsga@garshol.priv.no></value> </header> <header> <name>Subject</name> <value>A funny picture</value> </header> <header> <name>Message-ID</name> <value><50325BA28B0934821A57805FB7C@mail.public.com></value> </header> <header> <name>Date</name> <value>Fri, 8 Oct 1999 11:26:22 +0200</value> </header> <header> <name>MIME-Version</name> <value>1.0</value> </header> <header> <name>X-Mailer</name> <value>Internet Mail Service (5.5.2448.0)</value> </header> <header> <name>Content-Type</name> <value>multipart/mixed</value> </header> <header> <name>X-UIDL</name> <value>37ef28060000035b</value> </header> <attachment> <header> <name>Content-type</name> <value>text/plain</value> </header> Hi Lars, here is a funny picture. </attachment> <attachment> <header> <name>Content-type</name> <value>image/gif; name="funny.gif"</value> </header> <header> <name>Content-transfer-encoding</name> <value>base64</value> </header> <header> <name>Content-disposition</name> <value>attachment; filename="funny.gif"</value> </header> ... </attachment> </email>
Clearly, this is exactly the same information as in the plain text notation, but expressed in a different notation, and using a formalized data model. This is just one of many possible translations into XML that could be used. For example, we might very well have used dedicated element types to represent some of the more important header fields, such as To and From.
One noteworthy aspect of the XML version of the data is that we have decided to keep the original notation of the individual values.
Many of the values have an internal structure that might well have been captured in XML, but to keep the complexity of the example down this was not done. The base64 encoding of the GIF image was also kept; it is convenient for XML because it encodes the binary data, which may contain illegal byte sequences according to XML's rules, using only characters which have no special meaning in XML and thus can safely be used.
This document (complete pieces of XML data is called documents) has a corresponding conceptual structure as dictated by the XML data model. This structure is what mathematicians would describe as a tree, which means that it consists of pieces called nodes, each having one parent and any number of children. Another way to describe it is to say that it is strictly hierarchical. The data model is described in more detail in 2.4.4, "Drawing the line," on page 74. Figure 12 shows the structure of our XML email document as the data model.
Figure 1-2 The conceptual document structure
What Figure 12 shows is the true structure of the XML document; the syntax is just its serialized form. In a way, one could say that this structure is what we meant, or had in mind, when we wrote the email XML document.
The diagram contains one node whose significance may not be immediately obvious. This is the "Document" node, which represents the entire XML document. Most systems that represent XML documents have something that is equivalent to this node. This is because it is convenient to have something that contains the entire document, where DTD information and information about what was before and after the document element can be stored. It is possible to make do without this node, however, and some systems do.
In addition to the syntax and the data model, there is a standardized API called the Document Object Model (the DOM), which defines one way of representing this structure using objects in a programming language. This means that how to represent XML documents inside programs has also been standardized. DOM implementations can read in XML documents and create the corresponding structure, and also write this structure back out in serialized form. The DOM is described in detail in Chapter 11, "DOM: an introduction," on page 396.
An application that wants to work with XML emails can use the DOM to access the contents of XML emails. However, that is much more awkward than using the Email class since it requires the application to work in terms of elements and attributes, rather than header fields and values. Because of this, it may be better to use the DOM to create an Email object, and then let the applications use that object instead.