Code Pages and Encoding
The most basic of definitions for character encoding is the assigning of a numeric code to a character. This particular number is called a code point. The OS represents these assigned code points by one or more bytes. This coding is a set of mappings between the bytes representing the numeric code used by the OS and the characters in the coded character set. This gives the OS a way to reference all available characters. If the encoding key is not available, potentially a different character is referenced, and the resulting data looks like garbage to the customer. To add to the complexity, there are many character sets and character encodings, giving us many ways to map among bytes, code points, and characters. Code samples in upcoming sections demonstrate this in action.
But where did this complexity originate? If we’re talking about characters—which are typically stored in one or two bytes—and numeric codes assigned to these characters, then why is there not a one-to-one correspondence? Let’s all sit back, relax, and enjoy a small history lesson on encoding.
ASCII Character Set
Back when the IBM-PC was first introduced—the Stone Age in computer time—due to localization being a lower priority, the characters having the highest importance were numbers, punctuation symbols, and unaccented English letters. All of them had a code associated with them, collectively called ASCII (American Standard Code for Information Interchange), which represented every character using a numeric value from 32 to 127. The capital letter “D” has a code point of 68 (decimal value), a lowercase “m” has a code of 77, and an exclamation point (“!”) has a code point of 33. All code points are conveniently stored in seven bits. Most systems at this time were using bytes of eight bits in length, so every possible ASCII character could be stored with a bit to spare. All code points below 32 were labeled unprintable and were used for control characters, such as 10, which is a “line feed,” and 13, which is a “carriage return.”
Extended Character Set
Noticing that bytes have room for a total of eight bits, people collectively got the idea, “Hey, codes 128 through 255 are available for our own aspirations.” One of the aspirations that came to be from this was the IBM-PC’s original equipment manufacturer (OEM) character set, which provided some support for European languages, specifically some accented characters, drawing characters including horizontal bars and vertical bars, and other characters.
After computers were purchased outside of the U.S., all manner of different OEM character sets appeared. All of these used the spare 128 characters for their own designs. Now what would happen in some circumstances was that a character that was encoded based on a different character set would appear as a completely different character on a computer using its own extended character set. For example, on some computers, the character code 130 would display as “é” but on computers sold in Israel the character would display as the Hebrew letter gimel (). When Americans would send their “résumés” to Israel, they would arrive as .” In many cases, such as with Russian, there were many divergent ideas related to the upper 128 characters, which resulted in not being able to reliably interchange Russian documents.
ANSI Standard
Eventually, this OEM free-for-all got codified in the ANSI (American National Standards Institute) standard. With the ANSI standard, the consensus was to handle the characters below 128 the same as ASCII, and the handling of characters from 128 and up would depend on the locale. Code pages were established to handle these upper value characters.
The different ideas were codified into what are known as code pages. A code page is a table of values that describes a language’s encoding for a particular character set. Each of these code pages had a value associated with it. Greek speakers would use code page 737, Cyrillic speakers code page 855, and so on. All of these code pages were the same from codes 0 to 128, but different from codes 129 and up.
Asian Character Support: DBCS
This subject becomes even more complex when we’re dealing with Asian character sets. Because the Chinese, Japanese, and Korean languages contain more than 256 characters, a different scheme needed to be developed, and it had to compete with the concept of code pages holding only 256 characters. The result of this was the double-byte character set (DBCS).
Each Asian character is represented by a pair of code points (hence the term double-byte), which allows for representing up to 65,536 characters. For programming awareness, a set of points are set aside to represent the first byte of the set and are not valued unless they are immediately followed by a defined second byte. DBCS meant that you had to write code that would treat these pairs of code points as one, and this still disallowed the combining of, say, Japanese and Chinese in the same data stream because depending on the code page, the same double-byte code points represent different characters for the different languages.