HAPPY BOOKSGIVING
Use code BOOKSGIVING during checkout to save 40%-55% on books and eBooks. Shop now.
Register your product to gain access to bonus material or receive a coupon.
Unicode
The authoritative, technical guide to the creation of software for worldwide use.
Detailed specifications for Unicode:
Expanded implementation guidelines by experts in global software design:
Comprehensive charts, references, glossary, and indexes:
CD-ROM
The comprehensive Unicode Character Database for:
International, national, and vendor character mappings for:
Unicode Technical Reportsthat extend the standard for:
Acknowledgments.
Unicode Consortium Members and Directors.
Full Members.
Current Associate Members.
Current Liaison Members.
Current Specialist Members.
Current Individual Members.
Current Members of the Board of Directors.
About the Unicode Standard.
Concepts, Architecture, Conformance, and Guidelines.
Character Block Descriptions.
Charts and Index.
Appendices and Tables.
The Unicode Character Database and Technical Reports.
On the CD-ROM.
Notational Conventions.
Extended BNF.
Operators.
Resources.
Unicode Website.
Unicode Anonymous FTP Site.
Unicode Public Mailing List.
How to Contact the Unicode Consortium.
Coverage.
Standards Coverage.
New Characters.
Design Basis.
Text Handling.
Interpreting Characters.
Text Elements.
The Unicode Standard and ISO/IEC 10646.
The Unicode Consortium.
The Unicode Technical Committee.
Architectural Context.
Basic Text Processes.
Text Elements, Code Values, and Text Processes.
Text Processes and Encoding.
Unicode Design Principles.
Sixteen-Bit Character Codes.
Efficiency.
Characters, Not Glyphs.
Semantics.
Plain Text.
Logical Order.
Unification.
Dynamic Composition.
Equivalent Sequence.
Convertibility.
Encoding Forms.
UTF-16.
UTF-8.
Character Encoding Schemes.
Unicode Allocation.
Allocation Areas.
Codespace Assignment for Graphic Characters.
Nongraphic Characters, Reserved and Unassigned Codes.
Writing Direction.
Combining Characters.
Sequence of Base Characters and Diacritics.
Multiple Combining Characters.
Multiple Base Characters.
Spacing Clones of European Diacritical Marks.
Special Character and Noncharacter Values.
Byte Order Mark (BOM).
Special Noncharacter Values.
Separators.
Layout and Format Control Characters.
The Replacement Character.
Controls and Control Sequences.
Control Characters.
Representing Control Sequences.
Conforming to the Unicode Standard.
Characters Not Used in a Subset.
Referencing Versions of the Unicode Standard.
Conformance Requirements.
Byte Ordering.
Invalid Code Values.
Interpretation.
Modification.
Transformations.
Bidirectional Text.
Unicode Technical Reports.
Semantics.
Characters and Coded Representations.
Simple Properties.
Combination.
Decomposition.
Compatibility Decomposition.
Canonical Decomposition.
Surrogates.
Transformations.
Special Character Properties.
Canonical Ordering Behavior.
Combining Classes.
Canonical Ordering.
Use with Collation.
Conjoining Jamo Behavior.
Syllable Boundaries.
Standard Syllables.
Hangul Syllable Composition.
Hangul Syllable Decomposition.
Hangul Syllable Names.
Bidirectional Behavior.
Directional Formatting Codes.
Basic Display Algorithm.
Definitions.
Resolving Embedding Levels.
Reordering Resolved Levels.
Bidirectional Conformance.
Implementation Notes.
Case — Normative.
Combining Classes — Normative.
Directionality — Normative.
Jamo Short Names — Normative.
General Category — Normative in Part.
Numeric Value — Normative.
Mirrored — Normative.
Unicode 1.0 Names.
Mathematical Property.
Letters and Other Useful Properties.
Transcoding to Other Standards.
Issues.
Multistage Tables.
7-Bit or 8-Bit Transmission.
Mapping Table Resources.
ANSI/ISO C wchar_t.
Unknown and Missing Characters.
Unassigned and Private Use Character Codes.
Interpretable but Unrenderable Characters.
Reassigned Characters.
Handling Surrogate Pairs.
Handling Numbers.
Handling Properties.
Normalization.
Compression.
Line Handling.
Regular Expressions.
Language Information in Plain Text.
Requirements for Language Tagging.
Working with Language Tags.
Language Tags and Han Unification.
Editing and Selection.
Consistent Text Elements.
Strategies for Handling Nonspacing Marks.
Keyboard Input.
Truncation.
Rendering Nonspacing Marks.
Positioning Methods.
Locating Text Element Boundaries.
Boundary Specification.
Example Specifications.
Grapheme Boundaries.
Word Boundaries.
Line Boundaries.
Sentence Boundaries.
Random Access.
Identifiers.
Syntactic Rule.
Sorting and Searching.
Culturally Expected Sorting.
Unicode Character Equivalence.
Similar Characters.
Levels of Comparison.
Ignorable Characters.
Multiple Mappings.
Collating Out-of-Scope Characters.
Unmapped Characters.
Parameterization.
Optimizations.
Searching.
Sublinear Searching.
Case Mappings.
General Punctuation.
Punctuation: U+0020-U+00BF.
General Punctuation: U+2000-U+206F.
CJK Symbols and Punctuation: U+3000-U+303F.
CJK Compatibility Forms: U+FE30-U+FE4F.
Small Form Variants: U+FE50-U+FE6F.
Latin 160.
Letters of Basic Latin: U+0041-U+007A.
Letters of the Latin-1 Supplement: U+00C0-U+00FF.
Latin Extended-A: U+0100-U+017F.
Latin Extended-B: U+0180-U+024F.
IPA Extensions: U+0250-U+02AF.
Latin Extended Additional: U+1E00-U+1EFF.
Latin Ligatures: FB00-FB06.
Greek.
Greek: U+0370-U+03FF.
Greek Extended: U+1F00-U+1FFF.
Cyrillic.
Cyrillic: U+0400-U+04FF.
Armenian.
Armenian: U+0530-U+058F.
Georgian.
Georgian: U+10A0-U+10FF.
Runic.
Runic: U+16A0-U+16F0.
Ogham.
Ogham: U+1680-U+169F.
Modifier Letters.
Spacing Modifier Letters: U+02B0-U+02FF.
Combining Marks.
Combining Diacritical Marks: U+0300-U+036F.
Combining Marks for Symbols: U+20D0-U+20FF.
Combining Half Marks: U+FE20-U+FE2F.
Hebrew.
Hebrew: U+0590-U+05FF.
Alphabetic Presentation Forms: U+FB1D-U+FB4F.
Arabic.
Arabic: U+0600-U+06FF.
Cursive Joining.
Ligatures.
Arabic Presentation Forms-A: U+FB50-U+FDFF.
Arabic Presentation Forms-B: U+FE70-U+FEFF.
Syriac.
Syriac: U+0700-U+074F.
Syriac Shaping.
Syriac Cursive Joining.
Ligatures.
Thaana.
Thaana: U+0780-U+07BF.
Devanagari.
Devanagari: U+0900-U+097F.
Bengali.
Bengali: U+0980-U+09FF.
Gurmukhi.
Gurmukhi: U+0A00-U+0A7F.
Gujarati.
Gujarati: U+0A80-U+0AFF.
Oriya.
Oriya: U+0B00-U+0B7F.
Tamil.
Tamil: U+0B80-U+0BFF.
Telugu.
Telugu: U+0C00-U+0C7F.
Kannada.
Kannada: U+0C80-U+0CFF.
Malayalam.
Malayalam: U+0D00-U+0D7F.
Sinhala.
Sinhala: U+0D80-U+0DFF.
Thai.
Thai: U+0E00-U+0E7F.
Lao.
Lao: U+0E80-U+0EFF.
Tibetan.
Tibetan: U+0F00-U+0FBF.
Myanmar.
Myanmar: U+1000-U+109F.
Khmer.
Khmer: U+1780-U+17FF.
Han.
CJK Unified Ideographs.
CJK Compatibility Ideographs: U+F900-U+FAFF.
Kanbun: U+3190-U+319F.
CJK and KangXi Radicals: U+2E80-U+2FD5.
Ideographic Description: U+2FF0-U+2FFB.
Hiragana.
Hiragana: U+3040-U+309F.
Katakana.
Katakana: U+30A0-U+30FF.
Halfwidth and Fullwidth Forms: U+FF00-U+FFEF.
Hangul.
Hangul Jamo: U+1100-U+11FF.
Hangul Compatibility Jamo: U+3130-U+318F.
Hangul Syllables: U+AC00-U+D7A3.
Bopomofo.
Bopomofo: U+3100-U+312F.
Yi.
Yi: U+A000-U+A4CF.
Ethiopic.
Ethiopic: U+1200-U+137F.
Cherokee.
Cherokee: U+13A0-U+13FF.
Canadian Aboriginal Syllabics.
Canadian Aboriginal Syllabics: U+1400-U+167F.
Mongolian.
Mongolian: U+1800-U+18AF.
Currency Symbols.
Currency Symbols: U+20A0-U+20CF.
Letterlike Symbols.
Letterlike Symbols: U+2100-U+214F.
Number Forms.
Number Forms: U+2150-U+218F.
Superscripts and Subscripts: U+2070-U+209F.
Mathematical Operators.
Mathematical Operators: U+2200-U+22FF.
Arrows: U+2190-U+21FF.
Technical Symbols.
Control Pictures: U+2400-U+243F.
Miscellaneous Technical: U+2300-U+23FF.
Optical Character Recognition: U+2440-U+245F.
Geometrical Symbols.
Box Drawing: U+2500-U+257F.
Block Elements: U+2580-U+259F.
Geometric Shapes: U+25A0-U+25FF.
Miscellaneous Symbols and Dingbats.
Miscellaneous Symbols: U+2600-U+26FF.
Dingbats: U+2700-U+27BF.
Enclosed and Square.
Enclosed Alphanumerics: U+2460-U+24FF.
Enclosed CJK Letters and Months: U+3200-U+32FF.
CJK Compatibility: U+3300-U+33FF.
Braille.
Braille: U+2800-U+28FF.
Control Codes.
C0 Control Codes: U+0000-U+001F.
C1 Control Codes: U+0080-U+009F.
Layout Controls.
Layout Controls.
Deprecated Format Characters.
Deprecated Format Characters: U+206A-U+206F.
Surrogates Area.
Surrogates Area: U+D800-U+DFFF.
Private Use Area.
Private Use Area: U+E000-U+F8FF.
Specials.
Specials: U+FEFF, U+FFF0-U+FFFF.
Character Names List.
Images in the Code Charts and Character Lists.
Cross References.
Case Form Mappings.
Decompositions.
Information about Languages.
Reserved Characters.
CJK Unified Ideographs.
Hangul Syllables.
Han Radical-Stroke Index.
Shift-JIS Index.
Proposal Guidelines.
Requirements of Proposal Form and Process.
Interim Solutions.
Sending Proposals.
History.
Unicode 1.0.
Unicode 2.0.
Unicode 3.0.
Encoding Forms in ISO/IEC 10646.
Zero Extending.
UCS Transformation Formats.
UTF-8.
UTF-16.
Synchronization of the Standards.
Identification of Features for the Unicode Standard.
Character Names.
Character Functional Specifications.
Versions of the Unicode Standard.
Changes from Unicode Version 2.0 to Version 2.1.
New Characters Added.
Character Semantics Changes.
Changes Affecting Conformance.
Changes from Unicode Version 2.1 to Version 3.0.
New Characters Added.
Character Semantics Changes.
Changes Affecting Conformance.
Unicode Technical Reports.
Source Standards.
Source Dictionaries for Han Unification.
Other Sources for the Unicode Standard.
Selected Resources.
Unicode Names Index.
General Index. 0201616335T04062001
This book, The Unicode Standard, Version 3.0, is the authoritative source of information on the Unicode character encoding standard, the international character code for information processing that includes all major scripts of the world and is the foundation for development of software for worldwide use. As well as encoding characters used for written communication in a simple and consistent manner, the Unicode Standard defines character properties and algorithms for use in implementations.
Version 3.0 expands on material from Versions 2.0 and 2.1 and supersedes all other previous versions. The previous versions of the Unicode Standard are:
0.1 About the Unicode Standard
Concepts, Architecture, Conformance, and Guidelines
The Unicode Character Database and Technical Reports
The following Unicode Technical Reports are formally part of this standard:
On the CD-ROM
A range of Unicode values is expressed as U+xxxxAEU+yyyy, or U+xxxx--U+yyyy, or xxxx..yyyy, where xxxx and yyyy are the first and last Unicode values in the range, and the arrow, long dash, or two dots indicate a contiguous range inclusive of the endpoints.
In running text, a formal Unicode name is shown in small capitals (for example, GREEK SMALL LETTER MU), and alternative names (aliases) appear in italics (for example, umlaut). Italics are also used to refer to a text element that is not explicitly encoded (for example, pasekh alef) or to set off a foreign word (for example, the Welsh word ynghyd). Phonemic transcriptions are shown between slashes, as in Khmer /khnyom/.
The symbols used in the character names list are described at the beginning of Chapter 14, Code Charts.
In the text of this book, the word "Unicode" when used alone as a noun refers to the Unicode Standard.
In this book, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.
Extended BNF
A sequence of characters is sometimes listed in text with angle brackets, such as <a, grave> or <U+0061, U+0300>.
Table 0-1. Extended BNF
Symbols | Meaning |
x := ... | production rule |
x y | the sequence consisting of x then y |
x* | zero or more occurrences of x |
x? | zero or one occurrence of x |
x+ | one or more occurrences of x |
x | y | either x or y |
( x ) | for grouping |
x || y | equivalent to (x | y | (x y)) |
{ x } | equivalent to (x)? |
"abc" | string literals ( "_" is sometimes used to denote space for clarity) |
'abc' | string literals (alternative form) |
\u1234 | Unicode characters within string literals or character classes |
\v00101234 | Unicode scalar values within string literals or character classes |
U+HHHH | Unicode character literal: equivalent to '\uHHHH' |
U-HHHHHHHH | Unicode character literal: equivalent to '\vHHHHHHHH' |
charClass | character class (syntax below) |
Character Classes. A character class is constructed from one or two base sets. It is either a single base set, the negation of a base set, or the (set) difference between two base sets. The base sets themselves are bounded by brackets, and contain lists of characters, ranges of characters, general categories, or negations of general categories. The syntax follows:
charClass := baseSet | '¬' baseSet | baseSet '-' baseSet
General categories are defined in Chapter 4, Character Properties, such as {Uppercase Letter} for uppercase letter. Main categories such as {Mark} are the equivalent of a list of multiple subcategories: {Non-Spacing Mark}{Spacing Combining Mark}{Enclosing Mark}. Examples are found in Table 0-2, Character Class Examples.
Table 0-2. Character Class Examples
Syntax | Matches |
a-z | English lowercase letters |
a-z-c | English lowercase letters except for c |
¬c | all characters but c |
0-9 | European decimal digits |
\u0030-\u0039 | (same as above, using Unicode escapes) |
0-9, A-F, a-f | hexadecimal digits |
{Letter},{Non-Spacing Mark} | all letters and non-spacing marks |
{L},{Mn} | (same as above, using abbreviated notation) |
{¬Cn} | all assigned Unicode characters |
\u0600-\u06FF-{Cn} | all assigned Arabic characters |
Operators
Operators used in this standard are listed in Table 0-3, Operators.
Table 0-3. Operators
~ | allow break here (see Section 5.15, Locating Text Element Boundaries) |
x | do not allow a break here |
→ | is transformed to, or behaves like |
/ | integer division (rounded down) |
% | modulo operation; equivalent to the integer remainder for positive numbers |
0.3 Resources
The Unicode Consortium provides a number of online resources for obtaining information and data about the Unicode Standard, as well as updates and corrigenda. They are listed below.
Unicode Web Site