iOS Internationalization: Characters and Encoding
- Reading asks that you bring your whole life experience and your ability to decode the written word and your creative imagination to the page and be a co-author with the writer, because the story is just squiggles on the page unless you have a reader.
- Katherine Paterson
Characters. Letters. Symbols. Items used on a printed page and on a multitude of displays we use today. When used in an organized sequence, they are interpreted to give meaning or, in other words, create words. Languages use different characters with accent marks and pronunciation marks to accentuate or provide meaning. We’ll talk about what’s involved in creating characters, things like diacritics and surrogate pairs and ligatures, and storing those characters (encoding and code points).
Chapter topics include the following:
- What’s behind the scenes with characters
- How characters are stored and accessed by the OS
- How the OS determines what character to use based on its language setting
- How glyphs allow us to have different renditions or renderings of the same character
- What causes “garbage” characters or empty box characters to display
We’ll hit the essentials about characters, strings, encoding, Unicode, and glyphs, and wrap up with fonts.
Characters
What is a character, what constitutes a character, and how is that character represented as far as ye ole computer is concerned? A character is the smallest component of a written language that has semantic value. Focusing on the English (U.S.) alphabet, it’s composed of 26 characters, and depending on the order and combination of those 26 characters, words can be formed, returning even more meaning. This section discusses characters in generic terms to get you into the mind-set of thinking of individual characters. The other goal I have is to make you aware of the different “characteristics” of characters. How characters are handled at the operating system level is covered in the “Code Pages and Encoding” section.
Types of Characters
You will be working with more than the characters from the English (U.S.) alphabet, so let’s talk about characters that exist in other languages.
Accented Characters
Often, accents on characters such as the acute (´) accent and the grave (`) accent are referred to as diacritical marks. Other accents from European languages include the circumflex (^), umlaut (¨), and cedilla (¸).
The main use of accents is to change the accented character’s sound value. English examples include naïve and Noël. The accented characters (diereses in this case) show that these characters (vowels) are pronounced separately from the preceding vowel.
Acute and grave accents can indicate that a final vowel is to be pronounced, such as the French works résumé or été.
Accents can perform other functionality with different alphabetic systems. The Arabic harakat and the Hebrew niqqud systems are used for indicated vowel and tone sounds that are not conveyed through the basic alphabet. The Arabic sukūn and the Indic virama both designate the absence of a vowel. Special characters exist to mark for abbreviations or acronyms—as in the Cyrillic titlo and the Hebrew gershayim. The Greek language includes accents to indicate that letters of the alphabet are being used as numerals. In the Chinese Hanyu Pinyin system, accents are used to mark syllable tones in which the marked vowels occur.
Chinese Characters
Chinese characters in themselves do not make up an alphabet. The writing system for Chinese is logosyllabic, meaning that a character generally represents one syllable of spoken Chinese and might be a word on its own or a part of a polysyllabic word. Chinese characters are all derived from several hundred simple pictographs (representing physical objects) and ideographs (representing pronunciation or abstract notations).
Some Chinese characters have been adopted as part of the writing systems of other East Asian languages, such as Japanese and Korean. International software support for the Chinese, Japanese, and Korean languages is often shortcut as CJK. Table 2.1 shows examples of Asian characters.
Table 2.1 Sampling of Asian Characters
Language |
Character |
English Translation |
Chinese |
“tree” |
|
Japanese |
“fish” |
|
Korean |
“book” |
Characters for the Japanese language are usually a mixture of Chinese characters, or kanji, plus two syllabic scripts. At times the English (U.S.) alphabet is used as well. Having a working knowledge of 2,000 kanji characters is sufficient to read and comprehend most Japanese text.
Korean characters come mainly from an alphabetic script, Hangeul. Some hanja Chinese characters are used, but to a much lesser extent than with Japanese. When reading older Korean texts, an understanding of about 2,000 hanja characters is essential.
Table 2.2 contains different types of characters that are not necessarily found in any language’s alphabet but are interesting nonetheless. We’ll go over these types of characters and how to use and access them in the section, “Unicode and Encoding.” Punctuation characters have the capability to accentuate meaning, context, and understanding of text, but they do need to be associated with characters and words to accomplish that task. They cannot add meaning or understanding by themselves.
Table 2.2 Sampling of “Other” Characters
Category |
Character |
English Description |
Punctuation |
¶ |
Paragraph symbol, or pilcrow sign |
Pictographs |
Snowman |
|
Math Symbols |
∩ |
Intersection |
Letterlike Symbols |
Degree Celsius |