Unicode and Encoding
Hopefully, from what I presented in the preceding section, it is more than apparent just how nasty encoding can be, especially when you’re dealing with code pages. Thankfully, the pain was felt far and wide, and the result was Unicode. Two of Unicode’s mandates were to eliminate code page collisions and give each character its own individual code point value. The name “Unicode” comes from the desire to have a “universal” character set or, precisely, “universal code points.”
Unicode Planes
Unicode is broken down into several planes, a plane being a continuous group of 65,536 (216) code points. Plane 0 is indicated as the Basic Multilingual Plane (BMP) and is where almost all of your day-to-day characters reside. The notable exception to this is the Emoji characters. Planes 1 through 16 are largely empty of characters and are termed supplementary planes. Table 2.3 lists the available Unicode Planes and the types of characters they hold.
Table 2.3 Unicode Planes
Plane |
Character Types It Contains |
Plane 0 |
BMP—Contains characters for most modern languages plus a large number of special characters. Code points in this plane are also used to encode CJK characters. |
Plane 1 |
Supplementary Multilingual Plane (SMP)—Contains Emoji characters and other pictographs. Also holds historical scripts such as hieroglyphics. |
Plane 2 |
Supplementary Ideographic Plane (SIP)—Contains CJK characters. |
Planes 3–13 |
Unassigned—Temporarily named the Tertiary Ideographic Plane. |
Plane 14 |
Supplementary Special-purpose Plane (SSP)—Contains nongraphical characters, such as those used for XML language tag characters, as well as alternative glyphs. |
Planes 15–16 |
Supplementary Private Use Area-A and Area-B, respectively— Contains characters that are used internally by fonts for auxiliary glyphs and ligatures. |
In simplest terms, Unicode provides a code point for every character or symbol in nearly all the world’s writing systems. Unicode code points are written in the form “U+####” in which “####” is made up of four to six hexadecimal digits. To give three quick examples, the code point U+0062 (decimal 98) represents a lowercase “b”—the same character and same value in the Latin ASCII table. The Cyrillic capital letter “de” or “” has the Unicode code point of U+0414 (decimal 1044), and the fleur-de-lis symbol “” is U+269C (decimal 9884).
To provide room for 65,536 characters, Unicode was originally conceived as a 16-bit encoding. This provided enough space to encode all modern scripts around the world. Private Use areas were designated to hold rare or obsolete characters. Unicode has approximately 10% of its total available code points in use, leaving ample room to grow.
Combining Character Sequences
Certain characters can be represented either as a single code point or as a sequence of two or more code points. Take, for example, the “ì” character. This can be represented either as a single-character “ì” (“Latin Small Letter I with Grave,” U+00EC), or as a combination of two characters, “i” (“Latin Small Letter I,” U+0069) and “`” (“Combining Grave Accent,” U+0300). Both of these forms are variants of a composite or combining character sequence. This combination of characters is not restricted to Latin scripts but includes CJK character sets as well. With Hangul, the syllable can be represented as a single code point, U+AC00, or as the sequence + , U+1100 and U+1161.
As far as Unicode is concerned, the two characters are not equivalent—because they contain different code points—but they do have canonical equivalency. In other words, they have the same appearance and meaning. See the “Diacritics” section later in this chapter for examples of this combination in action.
Duplicate Characters
As you look through existing Unicode tables, do you see double? The characters might look the same, but that’s the only similarity they have. The display might be identical, but some characters are encoded at different code points to retain the character’s meaning. Take, for example, the Latin character “A” (U+0041). Its shape is identical to the Cyrillic “A” (U+0410), but they are two very separate characters. Having the separate code points also simplifies the conversion from legacy encodings.
Of course, there is the contrary scenario in which there truly are duplicate characters, both in display and in the character’s meaning. An example here would be the Angstrom sign, “Å.” This character owns two listings; the first has the character info “Latin Capital Letter A with Ring Above” (U+00C5), and the second has “Angstrom Sign” (U+212B).
Other characters that fall into this category have a property known as compatibility equivalence. Compatibility represents essentially the character but a slightly different visual appearance and a different behavior for this character. Concrete examples include Greek letters, which can be mathematical and technical symbols, Roman numerals, or actually Greek text.
Other examples of compatibility equivalence include ligatures. The single character “ff”—having character info of “Latin Small Ligature FF” (U+FB00) is compatible with the sequence of two characters “ff” having character info of “Latin Small Letter F” (U+0066) + “Latin Small Letter F” (U+0066). They might render and display similarly, but that similarity is dependent on context, typeface, and the text renderer.