- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Add End-tag
Close all paragraphs, list items, table cells, and other nonempty elements.
It is intended to include all the industries of the United States concerned in French trade under the following classifications:<p> <ol> <li>Machine-Tools, Wire, Transmission and Textiles <li>Milling Machinery <li>Electrical Apparatus <li>Transportation <li>Importers <li>Synthetic Products based on chemical processes <li>Bankers <li>Factory Architects, Engineers and Contractors </ol>
<p>It is intended to include all the industries of the United States concerned in French trade under the following classifications:</p> <ol> <li>Machine-Tools, Wire, Transmission and Textiles</li> <li>Milling Machinery</li> <li>Electrical Apparatus</li> <li>Transportation</li> <li>Importers</li> <li>Synthetic Products based on chemical processes</li> <li>Bankers</li> <li>Factory Architects, Engineers and Contractors</li> </ol>
Motivation
The first motivation is simply XML compatibility. XML parsers require that each start-tag be matched by a corresponding end-tag.
However, there's a strong additional reason. Many documents do not display as intended in classic HTML when the end-tags are omitted. The problem is not that the browsers do not know how or where to insert end-tags. It's that authors often do not arrange the tags properly. All too often, the boundaries of an unclosed HTML element do not fall where the author expects. The result can be a document that appears quite different from what is expected. Indentation problems are the most common symptom (elements are not indented that should be, or elements are indented too far). However, all sorts of display problems can result. CSS is extremely hard to create and debug in the face of improperly closed elements.
Potential Trade-offs
Few and minimal. The resultant documents may be slightly larger. If you're not serving gigabytes per day, this is not worth worrying about.
Mechanics
Manually, you simply need to inspect each file and determine where the end-tags belong. For example, consider this table modeled after one in the HTML 4 specification:
<table> <tr> <th rowspan="2"> <th colspan="2">Average <th rowspan="2">Blond Hair <tr><th>Height<th>Weight <tr><th>Boys<td>1.4<td>58<td>28% <tr><th>Girls<td>1.3<td>34.5<td>17% </table>
Only the </table> end-tag is present. All the other end-tags are implied. A browser can probably figure this out. A human author might not and is likely to insert new content in the wrong place. Add end-tags after each element, like so:
<table> <tr> <th rowspan="2"></th> <th colspan="2">Average</th> <th rowspan="2">Blond Hair</th> </tr> <tr> <th>Height</th> <th>Weight</th> </tr> <tr> <th>Boys</th> <td>1.4</td> <td>58</td> <td>28%</td> </tr> <tr> <th>Girls</th> <td>1.3</td> <td>34.5</td> <td>17%</td> </table>
Paragraphs are worth special attention here. When paragraph tags are omitted, the <p> start-tag usually serves as an end-tag rather than a start-tag. You'll commonly see content such as this tidbit from Through the Looking Glass:
Alice didn't like this idea at all: so, to change the subject, she asked 'Does she ever come out here?' <p> 'I daresay you'll see her soon,' said the Rose. 'She's one of the thorny kind.' <p> 'Where does she wear the thorns?' Alice asked with some curiosity. <p>
When encountering text such as this, you'll want to turn each <p> into a </p>, and then add the missing start-tags like so:
<p>Alice didn't like this idea at all: so, to change the subject, she asked 'Does she ever come out here?' </p> <p>'I daresay you'll see her soon,' said the Rose. 'She's one of the thorny kind.' </p> <p>'Where does she wear the thorns?' Alice asked with some curiosity. </p>
Tidy and TagSoup can fix this. However, they usually incorrectly guess the proper location of the start-tag and produce markup such as this:
Alice didn't like this idea at all: so, to change the subject, she asked 'Does she ever come out here?' <p> 'I daresay you'll see her soon,' said the Rose. 'She's one of the thorny kind.' </p> <p>'Where does she wear the thorns?' Alice asked with some curiosity. </p> <p> </p>
Tidy doesn't add the closing empty paragraph, but it still fails to find the start of the first paragraph. You can tell Tidy to wrap paragraphs around orphan text blocks using the --enclose-block-text option with the value y:
$ tidy -asxhtml --enclose-block-text y endtag.html
This doesn't matter for basic browser display, but it matters a great deal if you've assigned any specific CSS style rules to the p element. Furthermore, it can apply special formatting intended for the first paragraph of a chapter or section to the second instead.
Usually this happens only to the first paragraph in a section. However, if the runs of paragraphs are interrupted by a div, table, blockquote, or other element, there is likely such a block after each such block-level element.
Consequently, after running TagSoup over a page, search for empty paragraphs. Anytime you find one, it means there's probably a paragraph-less block of text earlier in the document that you should enclose in a new p element. However, this is tricky because often the start-tag and end-tag are on different lines. The following regular expression will find most occurrences:
<p>\s*</p>
This expression will find any empty paragraphs that have attributes:
<p\s[^>]*>\s*</p>
However, such paragraphs weren't created by Tidy or TagSoup, so you'll probably want to leave them in.