- What Is Well-Formedness?
- Change Name to Lowercase
- Quote Attribute Value
- Fill In Omitted Attribute Value
- Replace Empty Tag with Empty-Element Tag
- Add End-tag
- Remove Overlap
- Convert Text to UTF-8
- Escape Less-Than Sign
- Escape Ampersand
- Escape Quotation Marks in Attribute Values
- Introduce an XHTML DOCTYPE Declaration
- Terminate Each Entity Reference
- Replace Imaginary Entity References
- Introduce a Root Element
- Introduce the XHTML Namespace
Remove Overlap
Close every element within its parent element.
This is <strong><em>very important</strong></em>! <p>Sarah answered, <q>I'm really not sure about this.</p> <p>Maybe you should ask somebody else?</q> Then she sat down. </p>
This is <strong><em>very important</em></strong>! <p>Sarah answered, <q>I'm really not sure about this.</q> </p> <p><q>Maybe you should ask somebody else?</q> Then she sat down. </p>
Motivation
Different browsers do not build the same trees from documents containing overlapping elements. Consequently, JavaScript can work very differently than you expect between browsers.
Furthermore, small changes in a document with overlap can make radical changes in the DOM tree that even a single browser builds. Consequently, JavaScript built on top of such documents is fragile. CSS is likewise fragile. JavaScript, CSS, and other programs that read a document's DOM are hard to create, debug, and maintain in the face of overlapping elements.
Potential Trade-offs
Sometimes the nature of the text really does call for overlap—for instance, when a quote begins in one paragraph and ends in another. This comes up frequently in Biblical scholarship, for instance. Not all text fits neatly into a tree.
Unfortunately, HTML, XML, and XHTML cannot handle overlap in any reasonable fashion. If you're doing scholarly textual analysis, you may need something more powerful still. However, this is rarely a concern for simple web publication. You can usually hack around the problem well enough for browser display by using more elements than may logically be called for.
Mechanics
A validator will report all areas where overlap is a problem. However, overlap is so confusing to tools that they may not diagnose it properly or in an obvious fashion. Different validators will report problems in different locations, and a single validator may report several errors related to one occurrence. Sometimes the problem will be indicated as an unclosed element or an end-tag without a start-tag, or both. For example:
overlap.html:10: parser error : Opening and ending tag mismatch: q line 10 and p <p>Sarah answered, <q>I'm really not sure about this.</p> ^ overlap.html:11: parser error : Opening and ending tag mismatch: p line 11 and q <p>Maybe you should ask somebody else?</q> Then she
Furthermore, an overlap problem may cause a parser to miss the starts or ends of other elements, and it may not be able to recover. It is very common for overlap to cause a cascading sequence of progressively more serious errors for the rest of the document. Thus, you should start at the beginning and fix one error at a time. Often, fixing an overlap problem eliminates many other error messages.
Repairing overlap is not hard. Sometimes the overlap is trivial, as when the end-tag for the parent element immediately precedes the end-tag for the child element. Then you just have to swap the end-tags. For example, change this:
<strong><em>very important</strong></em>
to this:
<strong><em>very important</em></strong>
If the overlap extends into another element, you close the overlapping element inside its first parent and reopen it in the last. For example, suppose you have these two paragraphs containing one quote:
<p>Sarah answered, <q>I'm really not sure about this.</p> <p>Maybe you should ask somebody else?</q> Then she sat down.</p>
Change them to two paragraphs, each containing a quote:
<p>Sarah answered, <q>I'm really not sure about this.</q> </p> <p> <q>Maybe you should ask somebody else?</q> Then she sat down. </p>
If there are intervening elements, you'll need to create new elements inside those as well.
Tidy and TagSoup can fix technical overlap problems but not especially well, and the result is usually not what you would expect. For example, Tidy will not always reopen an overlapping element inside the next element. For instance, it turns this:
<p>Sarah answered, <q>I'm really not sure about this.</p> <p>Maybe you should ask somebody else?</q> Then she sat down.</p>
into this:
<p>Sarah answered, <q>I'm really not sure about this.</p> <p>Maybe you should ask somebody else? Then she sat down.</p>
It completely loses the quote in the second paragraph. TagSoup keeps the quote in the second paragraph but introduces a quote around the boundary whitespace between the two paragraphs:
<p>Sarah answered, <q>I'm really not sure about this.</p> <q></q> <p><q>Maybe you should ask somebody else?</q> Then she sat down.</p>
Consequently, I prefer to fix these overlap problems by hand if there aren't too many of them. You're more likely to reproduce the original intent that way.