- Representing data digitally
- XML and digital data
- Information systems
- XML and information systems
1.3 Information systems
An information system is a collection of information that is, in essence, a model of some aspects of the world. It is of interest to its users because it can answer questions about these aspects of the world. Before the advent of computerized information systems the only way to find out if, for example, a book was available in a library or lent out was to go and look at the shelf where the book ought to be. If it was not there, it would be assumed that someone was currently reading it. Today, however, librarians will consult the library information system to see whether the book is available or not, and only check the information from the system against external reality (the shelf) if the reader insists.
An information system need not be digital: A paper encyclopedia, for example, is an information system that can answer a large number of questions when consulted by a human. This book, however, is written strictly with digital information systems in mind. These are usually used to store information about the world external to the computer, but not always. One exception might be the registries that many computer systems maintain of installed software and configuration information for that software.
1.3.1 Anatomy of classical information systems
Any information system exists as part of a larger context in which the system plays a specific role. In the case of the library, the information system will be consulted and updated by the librarians. This will be its context, and the role it plays is something that can help answer questions such as "what books does the library have," "where can I find this book," "is this book available or not," and so on.
Figure 13 shows a diagrammatic outline of what the library information system might look like. In the center, there is a data store of some kind, most likely a relational database. Around it are several applications which all access the central data repository, without being aware of each other. These applications are used by three different groups of people: the librarians, the readers, and the system administrators. This is how classical information systems have generally been structured. There are some variations in the exact structure of the system, but in a broad outline, these are the features that most such systems have had.
Figure 13 The anatomy of the library information system
In such systems, the basis of the entire system is the schema used to define the internal data representation in the data store. The schema
defines how data is stored in the data store, and this determines how applications can access and work with the data. The schema defines the structure of the data and lays down constraints on it. For example, the schema might say that the book ID code must be unique, each row in the loan table must have valid book and reader ID codes, and so on. These rules are (usually) enforced by the data store, which means that even though there are many different applications, perhaps written by different people over a long period of time, one can be certain that none of the applications will violate these rules.
Another role played by the schema is that of documenting the structure of the data that the system manages as well as many of the assumptions made in the system design. Together with prose, the schema is very valuable as documentation, since it is concise, clear, and unambiguous.
It should also be noted that the schema plays a very important role in that it effectively defines the limits for what kinds of functionality can be supported by the applications in the system. For example, if the library information system does not record the Dewey classification code of each book, searching for books by their Dewey codes cannot be supported at all.
The information stored in a database is in a half-way state between liveness and suspendedness, not really being entirely in memory or entirely serialized. It should probably be considered to be live, since the application does not need to expend much effort to access the data and the data certainly are not serialized in the database. The data export application in Figure 13 would serialize information from the system into some kind of transport notation, whether for sending to other installations elsewhere or for backup purposes. Other than that, the library system does not really do any serialization or deserialization. It holds all its needed data internally and has little need for communication with the outside world, except through user interaction.
1.3.2 Formality in information systems
Digital information systems can usefully be divided into two categories: formal and informal systems. In formal systems the information follows strict rules, while informal systems are free-form. This division is not absolute, since systems can have varying degrees of formality, but a typical example of an informal information system might be a collection of word processor documents containing a list of the CDs available in a library in the form of prose.
Even though this collection of documents could be consulted by a human to find, for example, the number of songs in the CD collection, a computer would not be able to do the same, since it cannot read text and understand what it says. To enable a computer to answer this question, one would have to develop a formal information system to store the information in such a way that the computer, still without knowing what a song or a CD really is, could perform some simple operations that would result in the number of songs being counted.
Doing this, however, means formalizing the system and making it more rigid, which may be hard if the information in it has a very complex structure, or if that structure is poorly understood. Furthermore, a formal system will be harder to extend, since formal systems give much less flexibility in terms of how information is expressed. The benefit is much greater convenience in use through automation. For example, although a human might in theory count the songs on all the CDs in a library, that would require a large amount of manual work, while a properly designed digital information system could answer the question within seconds.
Quite often, an organization will start out with a highly formal system, such as one for books in a library. After the system has been in use for a while, the library starts stocking CDs in addition to books, but since CDs do not fit in the information system (the structure being too specifically directed towards books) the list of CDs is kept in simple text documents instead.
Eventually, this solution is bound to become insufficient to support the number of CDs that the library accumulates. To solve this, the original information system is extended with support for CDs, and the information in the text documents is migrated from the text documents to the larger system. From this point on, both CDs and books will be supported. Most large real-life information systems will at any point in their lifetime consist of a highly formal core with several smaller informal systems clustered around them. These informal systems will typically contain less data and often also be only temporary in nature. Some of them, however, will grow and eventually demand to be made more formal and need applications of their own.
One of the strengths of XML is that it supports this very well, since it can support both relatively informal and quite formal data. XML information systems also tend to be easier to set up initially and also to change later than their more formal competitors. XML is generally less formal and controlled than data in ordinary databases. With XML, checking validity is a separate operation, performed when necessary, and not something enforced by the data storage mechanism itself (except when an XML database with such functionality is used, which is relatively rare).
1.3.3 Ontologies
To be able to formalize the system, one really should design a schema that defines the structure, but before a schema can be made there are two steps that need to be taken. Often, these are taken without being explicitly thought through, and this may even work well, but it is still worth knowing about the steps.
The parts of the world that are considered within the scope of the information system are often called the Universe of Discourse (UoD) for that particular information system. The next step towards a schema is to analyze the UoD to find out what it consists of and which parts of it are considered interesting. In the example above, this would mean the CD collection of some library, and implicitly, only the music CDs (since we mentioned songs) and not the CD-ROMs with software and data.
This analysis would result in what is called an ontology, which means a theory of reality. Such a theory of reality might state that our particular UoD consists of CDs, artists, and songs. This is a pretty naive theory, though, as it omits many interesting aspects of the UoD. For example, artists can be individual people, such as Mariss Jansons and Peter Gabriel, but also groups of people, such as the Oslo Philharmonic Orchestra and Genesis. Some artists have released music both individually and as part of a group of people (for example, Peter Gabriel was a member of Genesis until 1975, but released solo albums after that).
Another, and even subtler problem arises when we try to count the songs in the CD collection, because we haven't decided what a song really is. For example, Peter Gabriel has released three different CDs that all contain a song titled Biko. Does this count as one song, or as three? The version on the album usually10 known as 3 is the original studio version, the version on Plays Live is a live recording, and the version on Shaking the Tree is indistinguishable from the original studio version on 3.
The complexity does not stop there, for these CDs are issued in slightly different versions in different countries, and records that were originally released as LPs are often re-released once on CD with poor quality and later remastered to much better quality. This produces CDs with identical titles and song listings, identical (or near-identical) covers, but with subtly different sounds.
Clearly, to be able to make a structured information system for something as messy as this, we need a theory of reality, an ontology that can tell us what is what. One such ontology already exists, and is known as IFLA FRBR, or Functional Requirements for Bibliographic Records, defined by the International Federation of Library Associations and Institutions. The specification can be found at http://www.ifla.org/VII/s13/frbr/frbr.pdf. This ontology deals with what it calls creations (not just music) and defines three main categories of creations:
manifestations
These are tangible creations that are either physical objects composed of atoms and molecules or digital objects consisting of bits and bytes. A CD and a track on a CD would both be manifestations, as would notes printed on paper.
performances
These are spatio-temporal creations, that is, creations that have taken place as events in space and time. A concert would be a typical example of a performance. If a performance is recorded somehow, that recording becomes a manifestation of the performance.11
works
Works are the least tangible category of creations, being abstract creations. For example, if you think of a new melody, that becomes an abstract creation, and its existence will not be revealed until you either make a manifestation of it (by writing down the notes) or a performance of it (by humming it or singing it out loud).
With this ontology in hand, we can suddenly make sense of the confusion we suffered earlier. The question "How many songs are there?" was ill-posed, in the sense that we had not properly defined the term "song." Instead, we have three new terms, and occurrences of these we can count with confidence. So, Biko is a work, which has been performed in the studio and also live in concert. The three occurrences of the work are three different manifestations of two different performances of one work.12
1.3.4 Information models
With the ontology in place, we can start to make an information model for our UoD. The information model is a detailed conceptual model of all the information in the system, including all types of items13 with their fields (or properties) and the relationships between them. For our example we could start by defining the item types CD, track, person, artist, and work (choosing to disregard performances) and then continue by defining the attributes of each and their relationships.
An information model differs from a schema in that the schema is defined in terms of a data model, while the information model is independent of any particular data model. In fact, part of the reason for making an information model is that the model is not plagued by the weaknesses of some data models, and this means that we can model the data more-or-less directly.14 The information model is generally created either informally, using some undefined data model, or it is created using some formal modelling language. Among the possibilities are the Entity-Relationship (ER) language, Object Role Modeling (ORM), and Unified Modeling Language (UML). Some people also use the EXPRESS schema language, since it is so powerful that even though one doesn't plan to use EXPRESS in the system to be developed, EXPRESS can serve to define the information model.
Once all the item types, their attributes, and relationships were worked out and clearly defined, we would have an information model for the information system. This would not be something that could be used directly to generate programs or to configure software to manage the system for us, but would be a conceptual specification that could serve as documentation for the system. Typically, the developers of the software components in the system would use the information model as guidance when developing the components, and it would also be used to set up any central data repositories such as a database.
To make a schema for the system, the developers would need to select a schema language and express the information model in terms of the data model used by that schema language. This step often involves more than a simple reformulation of the information model, since changes may prove necessary for various kinds of performance reasons. Generally, the information model is designed to be easy to understand, while the schema must be designed to be efficient.
1.3.5 Summary
To briefly reiterate the terms introduced in this section, an information system is a model of a subset of the external world known as the Universe of Discourse. The basis for the model is an ontology, a theory of reality, based on which a conceptual information model describing the detailed structure of the system is created. The information model is then turned into a schema for the data model used by the system (or possibly more than one schema, if the system uses more than one data model).