- Representing data digitally
- XML and digital data
- Information systems
- XML and information systems
1.4 XML and information systems
The first thing to realize is that the arrival of XML does not mean that all information systems that are not based on XML become obsolete all of a sudden. In fact, the reality is very much the opposite; XML and classical information systems are complementary and can be used together. Classical information systems are classical because they are extraordinarily useful, and XML will not change that. What XML is likely to change is the amount of interoperability between information systems. In some cases, it will also change what such systems can do and how they are put together.
This section examines how XML can be used with information systems, particularly classical ones, but also how it makes it possible to create new kinds of applications and uses.
1.4.1 XML in traditional information systems
Traditional information systems follow the basic anatomy outlined in Figure 13, with a central data store around which applications are clustered which access it. The exact form of this data store may vary with the application, and the arrival of XML has a number of consequences for the data store.
1.4.1.1 XML files
The most obvious way of basing an information system on XML is to simply use a set of XML documents, stored as files in the file system, as the central data store. This approach has been much used in document-oriented systems and is implicitly assumed by the standard interfaces of many XML tools. These tools expect to be run from the command line and to be passed file names as arguments. The main benefit of this approach is that it requires no work at all to set it up, and any developer and user can understand it.
The first consequence of this approach is that now the XML documents in the file system become the primary representation of the information in the system. The applications in the cluster around this data store will generally take one or more XML documents and produce some output from it. Very often this will be HTML or some other publishing format. Any updates to the information in the system must be made to the XML files, since all other renditions of the information are derived from these files. To have the updates reflected in the published files, one simply runs the translating applications again.
In general, all applications that wish to make use of the information in the XML files will use an XML parser to read the information into its own internal data structure (see 2.3.2, "The parser model," on page 57). This process must be repeated every time an application is started, which may be very awkward if the volume of the information is large. Any application that wishes to change the information must first load in the documents, then change its internal structure and finally write the information back out in XML form so that other applications can access it.
When modifying the source XML documents in this way it is important to preserve all important aspects of the documents in the transformation. But just as in the email example this may be difficult, since the programs are operating on an internal representation of the XML documents rather than the external form of the documents. Since the internal representation contains less information than the original documents did, necessary information may have been lost. We will return to this problem (and the solutions to it) in more detail later.
Of course, updating shared information in this way will often be dangerous, since multiple applications may attempt to modify the same document at the same time, which can cause information to be lost or corrupted. Another problem is that although one can make a schema for the data in the form of a DTD or an XML Schema definition,15 nothing prevents an application (or a user with a text editor) from modifying the XML files in a way that does not conform to the schema.
1.4.1.2 XML databases
Databases were invented to solve the problems with concurrent access to large volumes of information, and provide proven solutions to these problems. This makes them highly desirable for applications that either involve concurrent access or work with large volumes of data, and in fact also for many applications that do neither.
To use databases with XML one must implement the XML data model in a database and then use this to store the tree structure of the XML documents in the database. One approach to this is to use an existing database system, whether relational, object-oriented, or something else, and implement an XML storage system on top of it. (Note that this approach confuses the information model/data model distinction somewhat, since the data model of the database is now used to implement the XML data model.) Another approach is to develop a database specifically based on the XML data model. Such databases are often called native XML databases, since the XML data model is their only data model.
In both cases the solution has much in common with the "XML files" solution, the main difference being the location of the XML documents. The central data store still uses the XML data model and can also use the same kinds of schemas. When an application now wishes to use an XML document from the central data store, it will no longer load it into memory using a parser, but rather connect to the database. Once connected it will be presented with some API that represents the XML document inside the database and access the document information through this API (see 3.4, "Virtual documents," on page 92 for more information on this). This does away with the problems with large XML documents that do not fit in memory and take long to load, since documents are now not loaded at all and the database handles memory management transparently.
The manner in which the XML documents are updated is also changed completely, since the applications are in direct contact with a document that lives inside a database. To change a document, the application will make the change through the document API and then commit it to the database. The costly and risky operation of writing the document back out to disk is done away with; instead, the database updates its internal structure, taking care of any concurrency and data integrity issues.
The only disadvantage to this solution is that it takes longer to set up and requires more know-how. It may also be that the XML database solutions do not support all programming languages in the way that the "XML files" solution does. However, for large-scale projects, using files is generally not an option at all, making the choice obvious.
1.4.1.3 Traditional databases
However, it is definitely possible to use XML in an information system without having to use XML as the data model for the central data repository. Instead, the data store can use traditional databases and their data models, but map data back and forth between the database model and XML as needed. This has the advantage that existing systems can continue as they are today.
Imagine that the national library of some country decides one day that all libraries in the country must allow their users to search for the books they seek not only in the local libraries, but in all libraries in the country. The users should then be allowed to order any books not in the local library from other libraries and have them delivered to the local library to be picked up there.16This means that the library information system in Figure 13 must add more applications. It must now be able to produce, at regular intervals, some report in serialized form that shows the updates to the local database since the last report. This report will be sent to the national library which will use it to prepare a report of nation-wide updates to be sent to all libraries in the country. This means that the system must also be able to receive a similar report from the national library that provides similar updates to the national database of books. Figure 14 shows the information system updated to handle this new situation.
Figure 14 The library system with XML reporting
This information system also uses XML, but in a less direct way than the other approaches discussed so far. However, for information systems with more traditional data, this may be a much better solution than putting all data into the XML data model, since traditional databases have much more convenient data models.
1.4.2 Bridging information systems
The discussion of information systems given so far in this chapter is based on the traditional view of a database system, where there is a clearly defined information system and the database itself at the heart of that information system. However, most organizations do not have just one information system. Most of them have lots of information systems, and these are usually isolated from one another. XML
promises to solve this problem, by making it possible to build bridges between these systems. Or, to take an entirely different view of the same thing, XML does not require a central database, or even a clearly defined information system, and so it provides a completely different way of creating applications.
The XML equivalent of an information system is what is known as an XML application. An XML application consists of three things: an information model, an XML representation of the information model (often formalized in a DTD or schema definition) and all the programs that can work with data marked up according to the information model.
The result is that the traditional concept of one application or one information system does not apply to XML-based systems. With XML, the information becomes the focal point, and the software exists as a cloud of independent components and systems that interact with one another and accept or emit the XML-encoded data. How they interact with one another is not defined by XML at all, and many different arrangements are possible.
One example of this might be RSS (Rich Site Summary), which is a very simple XML application developed by Netscape for their my.netscape.com site. The idea behind this site was that it would allow Web site publishers to add simple news channels to their sites, which people could subscribe to through the my.netscape.com site. Each user would register and get a user name and password, and then subscribe to a selection of channels interest to them.
When logging into the site later, the user would be shown the current news from each channel he or she subscribed to. Effectively, this would be a personalized news system with content delivered by outside sources. The RSS DTD was developed to enable site publishers to mark up their news channels consisting of news items, each with a title and a link to some Web page with more information. (RSS is described in more detail in 6.4, "RSS: An example application," on page 149.) This application quickly became a big hit with site owners and hundreds of RSS channels were established, something that caused others to start making more RSS client systems. Today you can also subscribe to RSS channels through my.userland.com, geekboys.org and you can get at least three dedicated RSS clients to use on your desktop.
Figure 15 shows a conceptual view of RSS as an information system. As can be seen, it incorporates the following software components:
Figure 1-5 RSS as an information system
-
The publishing system of the site owner (manual or automated) that produces the site itself and the accompanying RSS document.
-
The RSS subscription and publishing system of my.user-land.com, developed with no knowledge at all of the site owner's publishing system, but which can still work with it, through the information provided by the RSS document. Effectively, the RSS document becomes an interface with unusually loose coupling.
-
my.netscape.comhas an equivalent system, developed independently of both my.userland.com and the site owner's system (www.geekboys.org is another example, and there are probably more).
-
The RSS client running on the end-user's computer is yet another software component independent of the others. In a sense, the Web browser could also be described as part of the system, even though it doesn't understand RSS at all.
To summarize, XML applications do not need to be information systems in the traditional sense, but that they can be something that joins together previously separate information systems in new ways.