3.4 Text markup
The "HT" in HTML stands for HyperText, and the early historical Web was very much textual. Despite all the graphic and multimedia advances of recent years, this textual foundation has not eroded. The advance of XML has, if anything, only strengthened it.
Any text markup language must provide a sufficient inventory of markup constructs for in-flow text fragments that for some reason must be differentiated from their context. Examples of such fragments include emphasized words or phrases, names or identifiers, quotes, and foreign language citations.
Block and inline elements. HTML (as well as other presentation-oriented
vocabularies, for instance XSL-FO) differentiates between block-level
and inline-level objects. This distinction has to do mostly with
visual formatting, as block-level elements are supposed to be stacked vertically,
while inline elements are part of the horizontal flow of text.4
Therefore, it is not really relevant for your semantic XML markup, which
must reflect content structure, not formatting. Still, since HTML is your
primary target format, the block/inline distinction may sometimes have repercussions
for your source definition.
Thus, it may be difficult to handle situations where a source element that normally transforms into an inline-level target element has to apply to a larger fragment of a document (3.4.3). From the XSLT viewpoint (4.5.1), block-level elements are more often generated by pull-style trunk templates, while inline-level elements are the exclusive domain of push-style branch templates.
Existing vocabularies. DocBook5 is an established standard dating from 1991 that is used mostly for technical books and documentation. It may well be the most widely used XML vocabulary after XHTML; when somebody tells you, "My documents are in XML," chances are it's actually DocBook. Software support for this vocabulary is also quite good.
DocBook is vast but not too deep, so it is simple to learn despite its large number of element types (epigraphs, bibliographies, programming code, glossaries, and so on). If you don't understand what a particular element type is supposed to do, probably you don't need it (yet). For those constructs you do need, however, DocBook may be a rich source of text markup and structuring wisdom.
TEI6 (Text Encoding Initiative) is an older and bigger beast, developed for markup of all kinds of scientific and humanities texts. Compared to DocBook, it is focused more on low-level text markup than on high-level book structures. The TEI DTD offers many modules that cover everything from verse to graph theory, so it is highly recommended if you need to mark up specialized texts. The TEI Guidelines7 is a very comprehensive and detailed guidebook explaining the use of the TEI DTD as well as many finer points of marking up complex text constructs.
3.4.1 Mark up the meaning
Your source XML must be semantic; that is, it must reflect the meaning of text-level constructs, not their presentation. The em and i element types, both present in HTML, provide a canonic illustration of this principle. While an i element dictates using an italic face in visual media, an em only designates an emphasis, which is a semantic concept rendered differently in different media. For example, a fragment of text inside em can be set in italic in a graphic browser, but it can also be highlighted in a text-mode browser or read aloud emphatically by a speech browser.
Modern HTML deprecates i and other presentation-oriented element types; instead, you are supposed to use appropriate semantic element types such as em, possibly in combination with CSS. In your XML source, however, deprecating anything is not an option - you have to make sure that with your schema, no presentation-oriented markup is possible at all. Formatting hints (3.6.2) can only be used in your XML when absolutely unavoidable.
3.4.2 Rich markup
The same visible formatting may result from different source markup. For example, you may use the same italic font face for both emphasis and citations, but they must be marked up differently in your source. What only a human reader can distinguish in the formatted result should, ideally, be automatically distinguishable in the source.
In general, semantic markup in the source should be richer and more detailed than the resulting HTML markup after transformation. For example, it is often a good idea to use special element types to mark up all dates, person names, or company names in your source, even though in the resulting web pages they are not formatted in any special way.
Why mark up what you don't need right here and now? Because your XML source
is more than just an undeveloped (as in "undeveloped film") version of the
web site. Rather, it is the start of a project that will keep growing and
changing, sprouting new connections and renditions over time. For example,
you may want to reuse your web site material in PDF brochures, interactive
CDs, archival and search applications, and more.
This means that your XML source must be able to serve as the semantic foundation not only for your current site but also for everything it can potentially become. You may not need any extra markup right now, but it may come in very handy when you extend your site or reuse the source documents for anything beyond the web site pages.
Imagine that one day you need to convert all dates on your site from one format to another (e.g., from MM/DD/YY to DD/MM/YY). Dealing with dates scattered in the text is so much easier if all of them are marked up consistently - for example,
... which happened on <date><month>09</month><day>04</day><year>2003</year></date>.
instead of simply
... which happened on 09/04/2003.
With rich markup, you can change dates' rendition (e.g., reorder date components or use a different separator character) without touching the source at all, simply by modifying the stylesheet.
On another occasion, you may decide to paint all company names (or only your own company's name) green on your web pages. Or, you may find it a good idea to automatically compile an index of all persons' names mentioned on your site. All of these tasks are only possible if your source XML has these elements consistently and unambiguously marked up.
The need for rich text markup obviously depends on the quality, value, and planned longevity of your material. You don't need rich markup for short-lived stuff, but if you want your material to remain useful in the long term, you should always try to think in terms of "what markup is perfect for this content" rather than "what markup is sufficient for the task at hand." Examples of long-lived or otherwise valuable content include standards, specifications, historical texts, etc.
Existing vocabularies. As an example (and a good source of ideas), consider NITF8 (News Industry Text Format), which is a standard vocabulary for rich markup of news stories. Only a necessary minimum of NITF markup may be used in a story that goes directly to press; however, for exchange, syndication, or archival use, a complete enriched NITF markup is required. A properly prepared NITF news story uses rich markup to answer questions such as who the story is about, when and where the described event occurred, and even why it is considered newsworthy by the story author.
3.4.3 Transcending levels
The text elements we've discussed in this section would be termed inline in HTML, meaning they are only allowed within block elements such as paragraphs. However, this limitation does not always make sense. For example, a rich markup element such as emphasis may need to be applied to more than one complete paragraph.
Usually, this is an indication that these paragraphs constitute some logical entity, such as a quotation, which (rather than the emphasis itself) you need to mark up. However, there may be situations where no such element exists, but inline text markup still has to spread across one or more block elements. What are we to do in such cases?
Inserting a separate inline markup element within each paragraph is the least elegant solution:
<p><em>This is the first paragraph using emphasis throughout.</em></p> <p><em>And this is the second emphasized paragraph.</em></p>
This leads to unnecessary duplication of markup, poor maintainability, and just plain ugliness. This is the only option, however, if your emphasis spans one paragraph and a half.
The simplest approach is to just do away with the inline/block distinction and allow any text markup to be applied at any level of the hierarchy, both below and above the paragraph level. This will allow you to enclose all affected paragraphs into a common parent element specifying emphasis:
<em> <p>This is the first paragraph using emphasis throughout.</p> <p>And this is the second emphasized paragraph. Note that we can use <em>nested emphasis</em>.</p> </em>
This might make sense, especially in contexts where you want to allow both paragraphs and short non-paragraph text fragments (3.3). The problem with this approach is that it blurs your hierarchy of element types, thereby making your documents harder to maintain and more prone to errors.
It might be argued, on the other hand, that the emphasis spanning one or more paragraphs is semantically different from the emphasis that spans one or more words. Therefore, they could use different element types:
<emphasis> <p>This is the first paragraph using emphasis throughout.</p> <p>And this is the second emphasized paragraph. Note that we can use <em>nested emphasis</em>, but this time it is a different element type for the inline level.</p> </emphasis>
If the paragraph-level emphasis is semantically connected with the paragraph element, you can instead add an attribute to those paragraphs that fall within its scope:
<p type="emphasis">This is the first paragraph using emphasis throughout.</p> <p type="emphasis">And this is the second emphasized paragraph. Again, <em>nested emphasis</em> is possible.</p>
Among these options, there is perhaps no single winner suitable for all situations. Your choice will depend on the semantics of the element in question, the frequency of its use at inline and block levels, and the possible connections between its semantics and that of the standard block-level element (paragraph).
3.4.4 Nested markup
Another issue with text markup is whether nesting elements of one type is to be allowed. Presentation-oriented markup never uses, for instance, i within i — but for semantic markup, a similar structure may be meaningful. Thus, emphasis within emphasis or a quote within a quote are both perfectly valid semantically, even though in an HTML rendition, nesting of the corresponding formatting elements may have no visible effect.
Therefore, to properly transform nested semantic markup, you must use different formatting depending on the nesting level of the semantic element. For example, if you use italic face for emphasis, nested emphasis can be rendered either as regular face ("toggle" approach, where you switch between regular and italic faces for each new nesting level) or as bold italic face ("additive" approach, in which the italic rendition of the parent is augmented by the bold formatting of the child).