Multiple Representations
Having discussed the importance of the principle of unification, we must also consider the opposite propertythe fact that Unicode provides alternate representations for many characters. As we saw earlier, Unicode's designers placed a high premium on respect for existing practice and interoperability with existing character encoding standards. In many cases, they sacrificed some measure of architectural purity in pursuit of the greater good (i.e., people actually using Unicode). As a result, Unicode includes code point assignments for a lot of characters that were included solely or primarily to allow for round-trip compatibility with some legacy standard, a broad category of characters known more or less informally as compatibility characters. Exactly which characters are compatibility characters is somewhat a matter of opinion, and there isn't necessarily anything special about the compatibility characters that flags them as such. An important subset of compatibility characters are called out as special because they have alternate, preferred representations in Unicode. Because the preferred representations usually consist of more than one Unicode code point, these characters are said to decompose into multiple code points.
There are two broad categories of decomposing characters: those with canonical decompositions (these characters are often referred to as "precomposed characters" or "canonical composites") and those with compatibility decompositions (the term "compatibility characters" is frequently used to refer specifically to these characters; a more specific term, "compatibility composite," is better). A canonical composite can be replaced with its canonical decomposition with no loss of data: the two representations are strictly equivalent, and the canonical decomposition is the character's preferred representation.8
Most canonical composites are combinations of a "base character" and one or more diacritical marks. For example, we talked about the character positioning rules and how the rendering engine needs to be smart enough so that when it sees, for example, an a followed by an umlaut, it draws the umlaut on top of the a: ä. Much of the time, normal users of these characters don't see them as the combination of a base letter and an accent mark. A German speaker sees ä simply as "the letter ä" and not as "the letter a with an umlaut on top." A vast number of lettermark combinations are consequently encoded using single character codes in the various source encodings, and these are very often more convenient to work with than the combinations of characters would be. The various European character encoding standards follow this patternfor example, assigning character codes to letteraccent combinations such as é, ä, å, û, and so onand Unicode follows suit.
Because a canonical composite can be mapped to its canonical decomposition without losing data, the original character and its decomposition are freely interchangeable. The Unicode standard enshrines this principle in law: On systems that support both the canonical composites and the combining characters that are included in their decompositions, the two different representations of the same character (composed and decomposed) are required to be treated as identical. That is, there is no difference between when represented by two code points and when represented with a single code point. In both cases, it's still the letter .
Most Unicode implementations must be smart enough to treat the two representations as equivalent. One way to do this is by normalizing a body of text to always prefer one of the representations. The Unicode standard actually provides four different normalized forms for Unicode text.
All of the canonical decompositions involve one or more combining marks, a special class of Unicode code points representing marks that combine graphically in some way with the character that precedes them. If a Unicode-compatible system sees the letter a followed by a combining umlaut, it draws the umlaut on top of the a. This approach can be a little more inconvenient than just using a single code point to represent the a-umlaut combination, but it does give you an easy way to represent a letter with more than one mark attached to it, such as you find in Vietnamese or some other languages: Just follow the base character with multiple combining marks.
Of course, this strategy means you can get into trouble with equivalence testing even without having composite characters. There are plenty of cases where the same character can be represented multiple ways by putting the various combining marks in different orders. Sometimes, the difference in ordering can be significant (if two combining marks attach to the base character in the same place, the one that comes first in the backing store is drawn closest to the character and the others are moved out of the way). In many other cases, the ordering isn't significantyou get the same visual result whatever order the combining marks come in. The different forms are then all legal and requiredonce againto be treated as identical. The Unicode standard provides for a canonical ordering of combining marks to aid in testing such sequences for equivalence.
The other class of decomposing characters is compatibility composites, characters with compatibility decompositions.9 A character can't be mapped to its compatibility decomposition without losing data. For example, sometimes alternate glyphs for the same character are given their own character codes. In these cases, a preferred Unicode code point value will represent the character, independent of glyph, and other code point values will represent the different glyphs. The latter are called presentation forms. The presentation forms have mappings back to the regular character they represent, but they're not simply interchangeable; the presentation forms refer to specific glyphs, while the preferred character maps to whatever glyph is appropriate for the context. In this way, the presentation forms carry more information than the canonical forms. The most notable set of presentation forms are the Arabic presentation forms, where each standard glyph for each Arabic letter, plus a wide selection of ligatures, has its own Unicode character code. Although rendering engines often use presentation forms as an implementation detail, normal users of Unicode are discouraged from using them and are urged to use the nondecomposing characters instead. The same goes for the smaller set of presentation forms for other languages.
Another interesting class of compatibility composites represent stylistic variants of particular characters. They are similar to presentation forms, but instead of representing particular glyphs that are contextually selected, they represent particular glyphs that are normally specified through the use of additional styling information (remember, Unicode represents only plain text, not styled text). Examples include superscripted or subscripted numerals, or letters with special styles applied to them. For example, the Planck constant is represented using an italicized letter h. Unicode includes a compatibility character code for the symbol for the Planck constant, but you could also just use a regular h in conjunction with some non-Unicode method of specifying that it's italicized. Characters with adornments such as surrounding circles fall into this category, as do the abbreviations sometimes used in Japanese typesetting that consist of several characters arranged in a square.
For compatibility composites, the Unicode standard not only specifies the characters to which they decompose, but also information intended to explain what nontext information is needed to express exactly the same thing.
Canonical and compatibility decompositions, combining characters, normalized forms, canonical accent ordering, and related topics are all dealt with in excruciating detail in Chapter 4.