Basic Transformation
We’ll start this project by developing a simple WordProcessingML-to-HTML transformation. It will only be able to convert the Word text into HTML paragraphs (<P> tags) with no additional formatting or markup. Even text within tables will be transformed into plain paragraphs.
The XSL document starts with the xsl:stylesheet tag, which must define all namespaces used by the WordProcessingML so that we’re able to match individual WordProcessingML tags during the transformation process (see Listing 1).
Listing 1 Start the XSL stylesheet with namespace definitions.
<?xml version="1.0" encoding="utf-8" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" xmlns:aml="http://schemas.microsoft.com/aml/2001/core" xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" xmlns:o="urn:schemas-microsoft-com:office:office" exclude-result-prefixes="w sl aml wx o">
The HTML framework (required output HTML tags) will be generated by matching the w:body tag. We could also match the w:wordDocument tag, but if we focus on the w:body tag, our XSL stylesheet will be able to process Office 2007 documents as well. (Office 2007 saves the document body in a separate file within the .docx ZIP archive, with w:body being its root element.) Listing 2 shows the corresponding XSL code, together with the xsl:output element defining the HTML DOCTYPE.
Listing 2 Generate the HTML framework.
<xsl:output method="html" encoding="windows-1250" doctype-public="-//W3C//DTD HTML 4.01//EN" doctype-system="http://www.w3.org/TR/html4/strict.dtd"/> <xsl:template match="w:body"> <html> <head> <xsl:if test="//o:DocumentProperties/o:Title"> <title> <xsl:value-of select="//o:DocumentProperties/o:Title" /> </title> </xsl:if> </head> <body> <xsl:apply-templates /> </body> </html> </xsl:template>
As the basic transformation will only process paragraphs (w:p tags), we need to do two things:
- Match the w:p (Word paragraph) tags within the w:body tag and create HTML <P> tags for them.
- Match the w:t (text fragments within ranges) tags and output their text.
We also need to match any other text nodes in the XML tree explicitly—and ignore them. Otherwise, we’d get extraneous information from document properties and binary Word field data mixed into our text. All three rules are displayed in Listing 3, and the complete transformation stylesheet can be downloaded from my web site.
Listing 3 Translate the w:p tags into paragraphs.
<xsl:template match="w:p[ancestor::w:body]"> <p><xsl:apply-templates /></p> </xsl:template> <xsl:template match="w:t/text()"> <xsl:value-of select="." /> </xsl:template> <xsl:template match="text()" />
After you have tested this transformation with a standalone XSL translator, you can use it straight from Word:
- Choose File > Save As. In the Save As dialog box, open the Save As Type drop-down list and select XML Document.
- Select the Apply Transform option.
- Click Transform and specify your XSL stylesheet as the transformation file (see Figure 1).
- Click Save. Word generates a WordProcessingML document and transforms it with your XSL document into the final HTML file. The only glitch is that the resulting file has an .xml extension.
Figure 1 Perform the XSL transformation within Word.