- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Convert Text to UTF-8
Reencode all text as Unicode UTF-8.
Motivation
Pages that use any content except basic ASCII have cross-platform display problems. Windows encodings are not interpreted correctly on the Mac and vice versa. Web browsers guess what encoding they think a page is in, but they often guess wrong.
UTF-8 is a standard encoding that works across all web browsers and is supported by all major text editors and other tools. It is reasonably fast, small, and efficient. It can support all Unicode characters and is a good basis for internationalization and localization of pages.
Potential Trade-offs
You need to be able to control your web server's HTTP response headers to properly implement this. This can be problematic in shared hosting environments. Bad tools do not always recognize UTF-8 when they should.
Mechanics
There are two steps here. First, reencode all content in UTF-8. Second, tell clients that you've done that. Reencoding is straightforward, provided that you know what encoding you're starting with. You have to tell Tidy that you want UTF-8, but once you do, it will do the work:
$ tidy -asxhtml -m --output-encoding utf8 index.html
TagSoup you don't have to tell. It just produces UTF-8 by default.
A number of command-line tools and other programs will also save content in UTF-8 if you ask, such as GNU recode (www.gnu.org/software/recode/recode.html), BBEdit, and jEdit. You should also set your editor of choice to save in UTF-8 by default.
The next step is to tell the browsers that the content is in UTF-8. There are three parts to this.
- Add a byte order mark.
- Add a meta tag.
- Specify the Content-type header.
The byte order mark is Unicode character 0xFEFF, the zero-width space. When this is the first character in a document, the browser should recognize the byte sequence and treat the rest of the content as UTF-8. This shouldn't be necessary, but Internet Explorer and some other tools are more reliable if they have it. Some editors add this automatically and some require you to request it.
The second step is to add a meta tag in the head, such as this one:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The charset=UTF-8 part warns browsers that they're dealing with UTF-8 if they haven't figured it out already.
Finally, you want to configure the web server so that it too specifies that the content is UTF-8. This can be tricky. It requires access to your server's configuration files or the ability to override the configuration locally. This may not be possible on a shared host, but it should be possible on a professionally managed server. On Apache, you can do this by adding the following line to your httpd.conf file or your .htaccess file within the content directory:
AddDefaultCharset utf-8
You really shouldn't have to do all three of these. One should be enough. However, in practice, some tools recognize one of these hints but not the others, and the redundancy doesn't hurt as long as you're consistent.
I do not recommend adding an XML declaration. XML parsers don't need it, and it will confuse some browsers.