Diacritics
A diacritic, or diacritical mark is a mark, point, or sign attached to a character to distinguish it from another of similar form. This mark can also give that character a particular phonetic value to indicate stress. A cedilla (hook or tail “¸”) accomplishes this when it is added under certain letters to modify their pronunciation: “ç ą Ȩ Ų.” Other diacritics that affect a character’s pronunciation include the tilde “~” and the circumflex (chevron-shaped “^” in the Latin script), or the macron when it is placed above a vowel, as in “ē Ā.” You will hear diacritics referred to as combining characters.
Diacritics can be treated in several ways:
- A diacritic can be a Roman base character plus a diacritic character. A combination such as “á” might be encoded either as a single character with character info of “Small A With Acute,” or as a sequence of characters, “Small A” + “Combining Acute.” Another example is the character “Ä” represented as the Unicode code point U+00C4, or as a pair of code points, U+0041 and U+0308.
- The “Small E with Grave,” or “è” character, might be rendered via glyphs—either a single composite glyph, or using two separate glyphs, one for “e” and another for an overstriking grave accent.
- In some orthographies (the relationship between sounds and letters), the combination “è” would be considered a grapheme (a letter or a number of letters that represent a phoneme, or speech sound that distinguishes one word from another), whereas in other orthographies “e” and the grave accent would each be considered a grapheme in a word. In other words, if an orthography has an “è” as a grapheme, then it should be encoded as a single character and an associated single glyph. On the flip side, if an orthography has separate graphemes for the “e” and the “`” (the grave accent), they should be encoded as separate characters and rendered as separate glyphs.
Precomposed Diacritics
We’ve been talking about diacritics and how they can be added to or combined with other characters. Some of the terms given to this combination are composite character, decomposable character, and precomposed character. Let’s look at an example to see why this distinction is important. The character “ñ” is a precomposed character because it is treated as an individual Unicode character and has a Unicode code point of U+00F1. Technically, this character can be decomposed into an equivalent string of a base character “n” (U+006E) and a combining tilde “~” (U+0303). Precomposed characters are a solution for handling legacy support of special characters in character sets. They are included for the primary reason of aiding systems with incomplete Unicode support in which the individual decomposed characters can be rendered successfully.
In looking at our “Small Letter N with Tilde” example, we could potentially be dealing with one single character or two individual, separate characters. If our code is doing any kind of character or string comparison, it is possible to have a test fail. To ensure that the expected single characters are used in the comparison, Unicode normalization is required. This can be accomplished via the precomposedStringWithCanonicalMapping method. Let’s look at some code. In Listing 2.5, we’ll work with our “Latin Small Letter N with Tilde” as a combined character and compare it to the precomposed character.
Listing 2.5 Displaying Combined and Precomposed Characters
NSString *combinedCharacter = @"n\u0303"; NSString *precomposedCharacter = @"ñ"; BOOL isEqual = [combinedCharacter isEqualToString:precomposedCharacter]; NSLog(@"The 'combined' character, '%@', is %@ to 'precomposed' character, '%@'", combinedCharacter, isEqual ? @"equal" : @"not equal", precomposedCharacter);
This returns “ñ is not equal to ñ.”
Now applying the same test, but first normalizing the characters, we use this:
NSString *combinedNormalized = [combinedCharacter precomposedStringWithCanonicalMapping]; NSString *precomposedNormalized = [precomposedCharacter precomposedStringWithCanonicalMapping]; BOOL isEqualNorm = [combinedNormalized isEqualToString:precomposedNormalized]; NSLog(@"The 'combined-normalized' character, '%@', is %@ to 'precomposed-normalized' character, '%@'", combinedCharacter, isEqualNorm ? @"equal" : @"not equal", precomposedCharacter);
This returns “ñ is equal to ñ.”