- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Change Name to Lowercase
Make all element and attribute names lowercase. Make most entity names lowercase, except for those that refer to capital letters.
<BLOCKQUOTE CITE= 'http://www.gutenberg.org/dirs/etext00/dvlft10.txt'> <P> It was, then, with <EM>considerable</EM> surprise that I received a telegram from Holmes last Tuesday&MDASH;he has never been known to write where a telegram would serve&MDASHin the following terms: </P> <P> Why not tell them of the Cornish horror&MDASH;strangest case I have handled. </P> </BLOCKQUOTE>
<blockquote cite= 'http://www.gutenberg.org/dirs/etext00/dvlft10.txt'> <p> It was, then, with <em>considerable</em> surprise that I received a telegram from Holmes last Tuesday—he has never been known to write where a telegram would serve—in the following terms: </p> <p> Why not tell them of the Cornish horror&MDASH;strangest case I have handled. </p> </blockquote>
Motivation
XHTML uses lowercase names exclusively. All elements and attributes are written in lowercase. For example, <table> is recognized but not <TABLE> or <Table>. In XHTML mode, lowercase is required.
Generic XML tools don't care about case but do care that it matches. That is, a <table> start-tag is closed by a </table> end-tag but not by </TABLE> or </Table>. The id attribute has the type ID as defined in the XHTML DTD and can be used as a link anchor. However, the id attribute does not and cannot.
Potential Trade-offs
There are relatively few trade-offs for converting to lowercase. All modern browsers support lowercase tag names without any problems. A few very old browsers that were never in widespread use, such as HotJava, only supported uppercase for some tags. The same is true of early versions of Java Swing's built-in HTML renderer. However, this has long since been fixed.
It is also possible that some homegrown scripts based on regular expressions may not recognize lowercase forms. If you have any scripts that screen-scrape your HTML, you'll need to check them to make sure they're also ready to handle lowercase tag names. Once you're done making the document well-formed, it may be time to consider refactoring those scripts, too, so that they use a real parser instead of regular expression hacks. However, that can wait. Usually it's simple enough to change the expressions to look for lowercase tag names instead of uppercase ones, or to not care about the case of the tag names at all.
Mechanics
The first rule of well-formedness is that every start-tag has a matching end-tag. The matching part is crucial. Although classic HTML is case-insensitive, XML and XHTML are not. <DIV> is not the same as <div> and a </div> end-tag cannot close a <DIV> start-tag.
For purely well-formedness reasons, all that's needed is to normalize the case. All tags could be capitalized or not, as long as you're consistent. However, it's easiest for everyone if we pick one case convention and stick to it. The community has chosen lowercase for XHTML. Thus, the first step is to convert all tag names, attribute names, and entity names to lowercase. For example:
- <P> to <p>
- <Table> to <table>
- </DIV> to </div>
- <BLOCKQUOTE CITE="http://richarddawkins.net/article,372,n,n"> to <blockquote cite="http://richarddawkins.net/article,372,n,n">
- © to ©
There are several ways to do this.
The first and the simplest is to use TagSoup or Tidy in XHTML mode. Along with many other changes, these tools will convert all tag and attribute names to lowercase. They will also change entity names that need to be in lowercase.
You also can accomplish this with regular expressions. Because HTML element and attribute names are composed exclusively of the Latin letters A to Z and a to z, this isn't too difficult. Let's start with the element names. There are likely to be thousands, perhaps millions, of these, so you don't want to fix them by hand.
Tags are easy to search for. This regular expression will find all start-tags that contain at least one capital letter:
<[a-zA-Z]*[A-Z]+[a-zA-Z]*
This regular expression will find all end-tags that contain at least one capital letter:
</[a-zA-Z]*[A-Z]+[a-zA-Z]*>
Entities are also easy. This regular expression finds all entity references that contain a capital letter other than the initial letters:
&[A-Za-z] [A-Za-z] [A-Z]+[A-Za-z]*;
I set up the preceding regular expression to find at least three capital letters to avoid accidentally triggering on references such as Ω that should have a single initial capital letter and on references such as Æ that have two initial capital letters. This may miss some cases, such as &Amp; and &AMp;, but those are rare in practice. Usually entity references are either all uppercase or all lowercase. If any such mixed cases exist, we'll find them later with xmllint and fix them by hand.
Attributes are trickier to find because the pattern to find them (=name) may appear inside the plain text of the document. I much prefer to use Tidy or TagSoup to fix these. However, if you know you have a large problem with particular attributes, it's easy to do a search and replace for individual ones—for instance, HREF= to href=. As long as you aren't writing about HTML, that string is unlikely to appear in plain text content.
Sometimes your initial find will discover that only a few tags use uppercase. For instance, if there are lots of uppercase table tags, you can quickly change <TD> to <td>, </TD> to </td>, <TR> to </tr>, and so forth without even using regular expressions. If the problem is a little broader, consider using Tidy or TagSoup. If that doesn't work, you'll need a tool that can replace text while changing its case. jEdit can't do this. However, Perl and BBEdit can. Use \L in the replacement pattern to convert all characters to lowercase. For example, let's start with the regular expression for start-tags:
(<[a-zA-Z]*[A-Z]+[a-zA-Z]*)
This expression will replace it with its lowercase equivalent:
\L\1