Using WordProcessingML to Generate Clean HTML from Word
- Microsoft Word as Authoring Tool
- Setting Up the Infrastructure
- Basic Transformation
- Transforming Paragraphs
- Formatting Text Ranges
- Summary
Microsoft Word as Authoring Tool
If your organization isn’t fully committed to using open-source software solutions, it’s highly likely that your staff uses Microsoft Word as one of its authoring tools. No doubt Microsoft Word is a great tool to author, comment, annotate, and review text, but when you need to adjust the Word-authored content to fit the design requirements of your web site, you might be in deep trouble.
Microsoft Word has been able to generate HTML output from source documents for quite a while. However, even in the Office 2003 release, the HTML output generated by Word is extremely verbose. While this situation might be improved with the "Web Page Filtered" save method, and the resulting document even further optimized with a variety of external tools (covered in Laurie Rowell’s excellent article "Clean HTML from Word: Can It Be Done?"), the fact remains that the HTML generated by Microsoft Word is not XHTML-compliant, or even HTML 4.0-compliant. For example, the HTML output generated by Word has no <!DOCTYPE> tag, and the tags without content are not properly closed (for instance, Word generates <BR> tags instead of <BR />). The list of other idiosyncrasies would probably exceed the length of this article, starting with paragraph borders getting changed into DIVs around paragraphs, bullets being converted into non-breakable spaces, etc.
Word-to-HTML cleanup utilities focus on tidying the HTML mess generated by Word, but don’t go beyond that stage. If your web standards differ even slightly from your Word templates, however, these utilities can’t help you. For example, if you use style names such as HA, HB, HC, etc. for your Word headings and <H1>, <H2>, <H3> tags, etc. on your web site, none of the Word cleanup utilities will be of much help. You’ll fare even worse if you want to include other document properties from Word in your HTML output (Intranet sites might need the author/reviewer/approved-by names and corresponding timestamps attached to document text). As a result, many of us write custom hacks in a variety of scripting languages (I had to do a few of them in Perl and VBScript) to adjust the Word output to the needs of a corporate web site.
While being busy reinventing the wheel, most of us have missed a nice opportunity to use a standards-based solution: The Office 2003 Professional Edition is able to export Word documents in pure XML format (called WordProcessingML), using well-documented Microsoft-defined XML schemas. These XML files contain all the information stored in a Word document but formatted in a standard, easy-to-parse format. With the XML source in hand, we can use a variety of tools to generate the target HTML markup. I’ll show you how to use the XSL transformations to do the job.