- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Escape Less-Than Sign
Convert < to <.
x < y ==> y > x
x < y ==> y > x
Motivation
Although some browsers can recover from an unescaped less-than sign some of the time, not all can. An unescaped less-than sign is more likely than not to cause content to be hidden in the browser. Even if you aren't transitioning to full XHTML, this one is a critical fix.
Potential Trade-offs
None. This change can only improve your web pages. However, you do need to be careful about embedded JavaScript within pages. In these cases, sometimes the less-than sign cannot be escaped. You can either move the script to an external document where the escaping is not necessary or reverse the sense of the comparison.
Mechanics
Because this is a real bug that does cause problems on pages, it's unlikely to show up in a lot of places. You can usually find all the occurrences and fix them by hand.
I don't know one regular expression that will find all cases of these. However, a few will serve to find most. The first thing to look for is any less-than sign followed by whitespace. This is never legal in HTML. This regular expression will find those:
<\s
If you're not using any embedded JavaScript, you can search for <(\s) and replace it with <\1. However, if you're using JavaScript, you need to be more careful and should probably let Tidy or TagSoup do the work.
If your pages involve mathematics at all, it's also worth doing a search for a < followed by a digit:
<\d
However, a validator such as xmllint or HTML Validator should easily find all cases of these, along with a few cases the simple search will mix.
Embedded JavaScript presents a problem here. JavaScript does not recognize < as a less-than sign. Inside JavaScript, you have to use the literal character. A less-than sign can usually be recast as a greater-than sign with arguments reversed. For example, instead of writing
if (x < 7)
you write
if (7 > x)
However, I normally just rely on placing the script in an external file or an XML comment instead:
<script type="text/javascript" language="javascript"> <!-- if (location.host.toLowerCase().indexOf("example.com") < 0 && location.host.toLowerCase().indexOf("example.org") <= 0) { location.href="http://www.example.org/"; }// --> </script>
This is a truly ugly hack and one I cringe to even suggest, but it is what seems to work and what browsers expect and deal with, and it is well-formed.
A lot of these problems can spread out across a site when the site is dynamically generated from a database and the scripts or templates that generate it do not sufficiently clean the data they're working with. A typical SQL database has no trouble storing a string such as x > y in a VARCHAR field. However, when you take data out of a database you have to clean it first by escaping any such characters. Most major templating languages have functions for doing exactly this. For instance, in PHP the htmlspecialchars function converts the five reserved characters (>, <, &, ', and ") into the equivalent entity references. Just make sure you use it. Even if you think there's no possible way the data can contain reserved characters such as <, I still recommend cleaning it. It doesn't take long, and it can plug some nasty security holes that arise from people deliberately injecting weird data into your system.