Encoding in Action
Wow. I’m thankful we have Unicode, and although it’s imperfect, it is a far, far more direct approach to working with characters and handling encoding. Let’s move on to seeing encoding in action with some code samples and then wrap up this section with some encoding bloopers. We’ve already talked about ASCII encoding; now let’s look at encoding definitions for two of the most common encoding formats, UTF-8 and UTF-16.
UTF-8
UTF-8 stands for Universal Character Set (UCS) Transformation Format—8-bit.
The UTF-8 format uses variable-width encoding and is capable of storing and representing every character in the Unicode character set. Its design was based on avoiding endianness complications and byte order marks found in the UTF-16 and UTF-32 encoding formats and, even more important, backward compatibility with the ASCII format (see the “Endianness” note later in the chapter for more detail about that and byte order). This encoding format accounts for more than half of all web pages, and the Internet Mail Consortium recommends that all email programs create and display messages using UTF-8. It’s increasingly becoming the default character encoding in software applications, operating systems, and programming languages. Xcode is a prime example of this, as shown in Figure 2.1.
Figure 2.1 The default encoding for source files in Xcode.
UTF-16
UTF-16 is the character encoding capable of encoding well over one million code points in the Unicode code space. UTF-16 is short for 16-bit Unicode Transformation Format. The Unicode code space encompasses from code point 0 to 0x10FFFF. Its encoding is variable-length; code points are encoded with one or two 16-bit code units. UTF-16 encoding provides excellent support for Asian languages. It is not, however, ASCII compatible.