- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Replace Empty Tag with Empty-Element Tag
Change elements such as <br> to <br class='empty' />.
Polonius<hr> You shall do marvelous wisely, good Reynaldo,<br> Before you visit him, to make inquire<br> Of his behavior.<br>
Polonius<hr class='empty' /> You shall do marvelous wisely, good Reynaldo,<br class='empty' /> Before you visit him, to make inquire<br class='empty' /> Of his behavior. <br class='empty' />
Motivation
XML parsers require that every start-tag have a matching end-tag. There can be no <p> without a corresponding </p>. Similarly, there can be no <br> without a corresponding </br>. Alternatively, you can use empty-element tag syntax, such as <br/> and <hr/>. This is usually simpler for elements that are guaranteed to be empty and more compatible with legacy browsers.
Potential Trade-offs
Although most modern browsers have no problem with empty-element tags, a few older ones you'll still find installed here or there, such as Netscape 3, do. For example, some will treat <br/> as an element whose name is br/ and will not insert the necessary break. Others will take <br></br> as a double break, rather than a single break. The content will still be present, but it may not be styled properly.
Mechanics
Classic HTML defines 12 empty elements:
- <br>
- <hr>
- <meta>
- <link>
- <base>
- <img>
- <embed>
- <param>
- <area>
- <frame>
- <col>
- <input>
In addition, a few other elements from various proprietary browser extensions may also appear:
- <basefont>
- <bgsound>
- <keygen>
- <sound>
- <spacer>
- <wbr>
Although XML and XHTML allow these tags to be written either with a start-tag/end-tag pair such as <br></br> or with an empty-element tag such as <br />, the latter is much friendlier to older browsers and to human authors. There's little reason not to prefer the empty-element tag.
However, even an empty-element tag such as <br/> can confuse some older browsers that actually read this as an unknown element with the name br/ instead of the known element br. Maximum compatibility is achieved if you add an attribute and a space before the final slash. The class attribute is a good choice. For example:
<br class="empty" /> <hr class="empty" />
I picked empty as the class to be clear why I inserted it. However, the value of the class attribute really doesn't matter. If you have reason to assign a different class to some or all of these elements, feel free.
TagSoup and Tidy will convert these elements as part of their fixup. However, neither adds the class="empty" attributes. You can add those with an extra search and replace step at the end, or you can just make the entire change with search and replace. I would start with the <br/> element. You can simply search for all <br> tags and replace them with <br class="empty" />.
However, there are a few things to watch out for. The first is whether someone has already done this. Check to see whether there are any </br> elements in the source. If any appear, first remove them, as they're no longer necessary.
The remaining concern is br tags with attributes, such as <br clear="all">. You can find these by searching for "<br."
If there aren't too many of these, I might just open the files and fix them manually. If there are a lot of them, you can automate the process, but this will require a slightly more complicated regular expression. The search expression is:
<br\s+([^>]*)=([^>]+[^/])>
The replace expression is:
<br \1=\2 />
When you're done, run your test suite to make sure all is well and you haven't accidentally broken something.
The hr element is handled almost identically. The meta and link elements are trickier because they almost always have attributes, so you need to use the more complicated form of the regular expressions. Of course, Tidy and TagSoup are also options.