Surrogate Characters
Surrogate characters are typically referred to as surrogate pairs. They are the combination of two characters, containing a single code point. To make the detection of surrogate pairs easy, the Unicode standard has reserved the range from U+D800 to U+DFFF for the use of UTF-16. No characters are assigned to code point values in this range. When programs see a bit sequence that falls in this range, they immediately—zip! zip!—know that they have encountered a surrogate pair.
This reserved range is composed of two parts:
- High surrogates—U+D800 to U+DBFF (total of 1,024 code points)
- Low surrogates—U+DC00 to U+DFFF (total of 1,024 code points)
A lone surrogate is invalid in UTF-16; surrogates are always written in pairs, with the high surrogate followed by the low.
With UTF-16 encoding, characters with code points in ranges U+0000 through U+D7FF and U+E000 through U+FFFD are stored as single 16-bit units.
Table 2.6 contains examples of surrogate pairs.
Table 2.6 Examples of Surrogate Pairs
Character |
Code Point |
Surrogate Pair |
U+10000 |
{U+D800, U+DC00} |
|
U+10E6D |
{U+D803, U+DE6D} |
|
U+1D11E |
{U+D834, U+DD1E} |
|
U+10FFFF |
{U+DBFF, U+DFFF} |
The following code snippet shows you how to get a printout of a surrogate pair when you are given its code point:
uniChar characterArray[2]; CFStringGetSurrogatePairForLongCharacter(0x10FFFF, characterArray); NSString *surrogate = [[NSString alloc] initWithCharacters:characterArray length:2]; NSLog(@"Surrogate: %@", surrogate);
Note that this is taking advantage of the CFStringGetSurrogatePairForLongCharacter function, which maps a UTF-32 character to a pair of UTF-16 surrogate characters. We need an array to plug the resulting UTF-16 pair into—that’s what the characterArray is for—and then the initWithCharacters:length: method of NSString does the rest.
Emoji
I’m making a special callout on the Emoji characters because they are extremely popular, and Apple both uses a special font to represent them and provides a keyboard to input just Emoji characters.
Introduced in the late 1990s from a Japanese mobile phone provider, Emoji is the Japanese term for picture characters. Created by Shigetaka Kurita as an effort to retain his company’s customer base, the smiley-faced icons gave their text messages more cuteness. The other supporting factor of the Emoji characters was the ability to give contextual information with a single character. What’s the weather going to be like today? That’s easily presented with a sun or umbrella or cloud Emoji character.
Figure 2.2 shows the first page of the Emoji keyboard.
Figure 2.2 The Emoji keyboard.
Apple Color Emoji is a font available on both iOS and OS X to provide support for the Unicode Emoji characters. Instead of this font having glyphs with black and white outlines, it has full-color, higher-resolution images for each of the nearly 900 glyphs it supports.
Strong support of Emoji has been a hard target to hit because it has historically occupied a private use area of Unicode with a range of code points from U+1F604 to U+1F539.