- XML Reference Guide
- Overview
- What Is XML?
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Table of Contents
- The Document Object Model
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- DOM and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- Implementations
- DOM and JavaScript
- Using a Repeater
- Repeaters and XML
- Repeater Resources
- DOM and .NET
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Downloads
- DOM and C++
- DOM and C++ Resources
- DOM and Perl
- DOM and Perl Resources
- DOM and PHP
- DOM and PHP Resources
- DOM Level 3
- DOM Level 3 Core
- DOM Level 3 Load and Save
- DOM Level 3 XPath
- DOM Level 3 Validation
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Implementations
- The Simple API for XML (SAX)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- SAX and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- SAX and .NET
- Informit Articles and Sample Chapters
- SAX and Perl
- SAX and Perl Resources
- SAX and PHP
- SAX and PHP Resources
- Validation
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Document Type Definitions (DTDs)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XML Schemas
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- RELAX NG
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Schematron
- Official Documentation and Implementations
- Validation in Applications
- Informit Articles and Sample Chapters
- Books and e-Books
- XSL Transformations (XSLT)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XSLT in Java
- Java in XSLT Resources
- XSLT and RSS in .NET
- XSLT and RSS in .NET Resources
- XSL-FO
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XPath
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XML Base
- Informit Articles and Sample Chapters
- Official Documentation
- XHTML
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XHTML 2.0
- Documentation
- Cascading Style Sheets
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XUL
- XUL References
- XML Events
- XML Events Resources
- XML Data Binding
- Informit Articles and Sample Chapters
- Books and e-Books
- Specifications
- Implementations
- XML and Databases
- Informit Articles and Sample Chapters
- Books and e-Books
- Online Resources
- Official Documentation
- SQL Server and FOR XML
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Implementations
- Service Oriented Architecture
- Web Services
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Creating a Perl Web Service Client
- SOAP::Lite
- Amazon Web Services
- Creating the Movable Type Plug-in
- Perl, Amazon, and Movable Type Resources
- Apache Axis2
- REST
- REST Resources
- SOAP
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- SOAP and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- WSDL
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- UDDI
- UDDI Resources
- XML-RPC
- XML-RPC in PHP
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Ajax
- Asynchronous Javascript
- Client-side XSLT
- SAJAX and PHP
- Ajax Resources
- JSON
- Ruby on Rails
- Creating Objects
- Ruby Basics: Arrays and Other Sundry Bits
- Ruby Basics: Iterators and Persistence
- Starting on the Rails
- Rails and Databases
- Rails: Ajax and Partials
- Rails Resources
- Web Services Security
- Web Services Security Resources
- SAML
- Informit Articles and Sample Chapters
- Books and e-Books
- Specification and Implementation
- XML Digital Signatures
- XML Digital Signatures Resources
- XML Key Management Services
- Resources for XML Key Management Services
- Internationalization
- Resources
- Grid Computing
- Grid Resources
- Web Services Resource Framework
- Web Services Resource Framework Resources
- WS-Addressing
- WS-Addressing Resources
- WS-Notifications
- New Languages: XML in Use
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Google Web Toolkit
- GWT Basic Interactivity
- Google Sitemaps
- Google Sitemaps Resources
- Accessibility
- Web Accessibility
- XML Accessibility
- Accessibility Resources
- The Semantic Web
- Defining a New Ontology
- OWL: Web Ontology Language
- Semantic Web Resources
- Google Base
- Microformats
- StructuredBlogging
- Live Clipboard
- WML
- XHTML-MP
- WML Resources
- Google Web Services
- Google Web Services API
- Google Web Services Resources
- The Yahoo! Web Services Interface
- Yahoo! Web Services and PHP
- Yahoo! Web Services Resources
- eBay REST API
- WordML
- WordML Part 2: Lists
- WordML Part 3: Tables
- WordML Resources
- DocBook
- Articles
- Books and e-Books
- Official Documentation and Implementations
- XML Query
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XForms
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Resource Description Framework (RDF)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Topic Maps
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation, Implementations, and Other Resources
- Rich Site Summary (RSS)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Simple Sharing Extensions (SSE)
- Atom
- Podcasting
- Podcasting Resources
- Scalable Vector Graphics (SVG)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- OPML
- OPML Resources
- Summary
- Projects
- JavaScript TimeTracker: JSON and PHP
- The Javascript Timetracker
- Refactoring to Javascript Objects
- Creating the Yahoo! Widget
- Web Mashup
- Google Maps
- Indeed Mashup
- Mashup Part 3: Putting It All Together
- Additional Resources
- Frequently Asked Questions About XML
- What's XML, and why should I use it?
- What's a well-formed document?
- What's the difference between XML and HTML?
- What's the difference between HTML and XHTML?
- Can I use XML in a browser?
- Should I use elements or attributes for my document?
- What's a namespace?
- Where can I get an XML parser?
- What's the difference between a well-formed document and a valid document?
- What's a validating parser?
- Should I use DOM or SAX for my application?
- How can I stop a SAX parser before it has parsed the entire document?
- 2005 Predictions
- 2006 Predictions
- Nick's Book Picks
Character Sets and Encodings
In the beginning, there was ASCII. And it was good. ASCII is a 7-bit system for defining characters a computer should display. After all, when you get right down to it, a computer only understands numbers -- and binary numbers, at that.
So when I want to the computer to display, say an f, I can tell it to use ASCII character number 102, and the computer knows what I mean. It chooses the f charcater from the ASCII character set. That set, with the lowercase letters and special characters omitted for simplicity, looks something like this:
A | B | C | D | E | F | G | H | I | J | K | L | M |
65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 |
N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 |
So when I want to display
FOX OLYMPICS
The computer actually sees
70 79 88 32 79 76 89 77 80 73 67 83
But I could go ahead and create a different character set of, say
Ж | З | ٸ | Θ | إ | س | г | ئ | ش | В | Т | І | Ї |
65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 |
И | в | פ | ל | Ι | ٸ | ף | х | א | ۓ | ل | م | ن |
78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 |
so the same display instructions to the computer for the new characters actually tells it to display
سвل вІمЇפشٸٸ
Now, aside from amusing ourselves by making up alphabets, why do we care about this?
We care about it beause the ASCII character set only has space for 128 characters. That means that anybody speaking a language with non-ASCII characters, such as the ä or é, is out of luck in the ASCII way of doing things.
That makes things tough for a supposedly universal format such as XML.
So what's a poor data architect to do? Give up on reaching the billions of people whose native languages use non-ASCII characters? Of course not. Not in today's global economy. You could simply specify a different character set for a document in another language, but even if that were practical for exchanging data files -- which it's not -- it wouldn't work because some languages, such as Chinese, have not hundreds, but tens of thousands of characters.
Enter Unicode. Unicode is a character set that is actually 16 bits, or 65,536 characters instead of ASCII's 7 bit, 128 character set. About 50,000 of those characters have been defined, including the typical unaccented Latin alphabet, accented characters, tens of thousands of Asian ideograph characters, and other languages such as Hebrew, Russian, and Arabic.
So everybody's happy and that takes care of the problem, right?
Well, not quite. Chances are, you're not going to need 65,000+ characters in your document. In many cases, you're going to be using a much smaller "slice" of the character set pie. In fact, you can usually fit all of your characters into an 8 bit, 256 character set, giving you two advantages. First, most text editors will be able to read your document, and second, it'll only be half the size of the 16 bit version. So how do these "slices" work?
Say, for example, I were writing a document in Russian. I'd need the Cyrillic aphabet for that. In Unicode, the Cyrillic alphabet takes up the slots between 1024 and 1273. But notice, that's only 249 slots, so if I started with a lower number, such as 1 instead of 1024, I could fit it all into a 8 bit character subset. So the Ж character might be number 182 instead of number 1046.
These slices, or subsets, are called character encodings. For example, the Cyrillic alphabet uses an encoding called ISO-8859-5. In English speaking countries, we typically use the ISO-8859-1 encoding.
All of this matters in XML because we can use virtually any encoding we want, so long as we specify it on the document. For example:
<?xml version="1.0" encoding="ISO-8859-1"> <candy> <item> <name>Chocolate bar</name> <calories>250</calories> </item> </candy>
XML processors are required to understand the UTF-8 and UTF-16 encodings. UTF-16 is, of course, the complete set of Unicode characters. UTF-8 is an 8 bit version of Unicode in which some characters are represented by up to three bytes rather than the traditional one byte. UTF-8 is the default encoding for XML, so if you leave out that optional encoding attribute, UTF-8 is assumed.