- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Escape Ampersand
Convert & to &.
<a href="/discipline/470.html">Health & Kinesiology</a> <img src="text.gif" alt="Texts & Technology" />
<a href="/discipline/470.html">Health & Kinesiology</a> <img src="text.gif" alt="Texts & Technology" />
Motivation
Although most browsers can handle a raw ampersand followed by whitespace, an ampersand not followed by whitespace confuses quite a few. An unescaped ampersand can hide content from the reader. Even if you aren't transitioning to full XHTML, this refactoring is an important fix.
Potential Trade-offs
None. This change can only improve your web pages.
However, you do need to be careful about embedded JavaScript within pages. In these cases, the ampersand usually cannot be escaped. Sometimes you instead can use an external script where the escaping is not necessary. Other times, you can hide the script inside comments where the parser will not worry about the ampersands.
Mechanics
Because this is a bug that results in visible problems, there usually aren't many cases of this. You can typically find all the occurrences and fix them by hand.
I don't know one regular expression that will find all unescaped ampersands. However, a few simple expressions will usually sniff them all out. First, look for any ampersand followed by whitespace. This is never legal in HTML. This regular expression will find those:
&\s
If the pages don't contain embedded JavaScript, simply search for &(\s) and replace it with \&\1. A validator such as xmllint or HTML Validator will easily find all cases of these, along with a few cases the simple search will mix. However, if pages do contain JavaScript, you must be more careful and should let Tidy or TagSoup do the work.
Embedded JavaScript presents a special problem here. JavaScript does not recognize & as an ampersand. JavaScript code must use the literal & character. I normally place the script in an external file or an XML comment instead:
<script type="text/javascript" language="javascript"> <!-- if (location.host.toLowerCase().indexOf("example.com") < 0 && location.host.toLowerCase().indexOf("example.org") <= 0) { location.href="http://www.example.org/"; }// --> </script>
If a site is dynamically generated from a database, this problem can become more frequent. A SQL database has no trouble storing a string such as "A&P" in a field, and indeed it is the unescaped string that should be stored.
When you receive data from a database or any other external source, clean it first by escaping these ampersands. For example, in a Java environment, the Apache Commons library includes a String-EscapeUtils class that can encode raw data using either XML or HTML rules.
Do not forget to escape ampersands that appear in URL query strings. In particular, a URL such as this:
http://example.com/search?name=detail&uid=165
must become this:
http://example.com/search?name=detail&uid=15
This is true even inside href attributes of a elements:
<a href= "http://example.com/search?name=detail&uid=16"> Search</a>