- XML Reference Guide
- Overview
- What Is XML?
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Table of Contents
- The Document Object Model
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- DOM and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- Implementations
- DOM and JavaScript
- Using a Repeater
- Repeaters and XML
- Repeater Resources
- DOM and .NET
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Downloads
- DOM and C++
- DOM and C++ Resources
- DOM and Perl
- DOM and Perl Resources
- DOM and PHP
- DOM and PHP Resources
- DOM Level 3
- DOM Level 3 Core
- DOM Level 3 Load and Save
- DOM Level 3 XPath
- DOM Level 3 Validation
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Implementations
- The Simple API for XML (SAX)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- SAX and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- SAX and .NET
- Informit Articles and Sample Chapters
- SAX and Perl
- SAX and Perl Resources
- SAX and PHP
- SAX and PHP Resources
- Validation
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Document Type Definitions (DTDs)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XML Schemas
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- RELAX NG
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Schematron
- Official Documentation and Implementations
- Validation in Applications
- Informit Articles and Sample Chapters
- Books and e-Books
- XSL Transformations (XSLT)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XSLT in Java
- Java in XSLT Resources
- XSLT and RSS in .NET
- XSLT and RSS in .NET Resources
- XSL-FO
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XPath
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XML Base
- Informit Articles and Sample Chapters
- Official Documentation
- XHTML
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XHTML 2.0
- Documentation
- Cascading Style Sheets
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XUL
- XUL References
- XML Events
- XML Events Resources
- XML Data Binding
- Informit Articles and Sample Chapters
- Books and e-Books
- Specifications
- Implementations
- XML and Databases
- Informit Articles and Sample Chapters
- Books and e-Books
- Online Resources
- Official Documentation
- SQL Server and FOR XML
- Informit Articles and Sample Chapters
- Books and e-Books
- Documentation and Implementations
- Service Oriented Architecture
- Web Services
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Creating a Perl Web Service Client
- SOAP::Lite
- Amazon Web Services
- Creating the Movable Type Plug-in
- Perl, Amazon, and Movable Type Resources
- Apache Axis2
- REST
- REST Resources
- SOAP
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- SOAP and Java
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- WSDL
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- UDDI
- UDDI Resources
- XML-RPC
- XML-RPC in PHP
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Ajax
- Asynchronous Javascript
- Client-side XSLT
- SAJAX and PHP
- Ajax Resources
- JSON
- Ruby on Rails
- Creating Objects
- Ruby Basics: Arrays and Other Sundry Bits
- Ruby Basics: Iterators and Persistence
- Starting on the Rails
- Rails and Databases
- Rails: Ajax and Partials
- Rails Resources
- Web Services Security
- Web Services Security Resources
- SAML
- Informit Articles and Sample Chapters
- Books and e-Books
- Specification and Implementation
- XML Digital Signatures
- XML Digital Signatures Resources
- XML Key Management Services
- Resources for XML Key Management Services
- Internationalization
- Resources
- Grid Computing
- Grid Resources
- Web Services Resource Framework
- Web Services Resource Framework Resources
- WS-Addressing
- WS-Addressing Resources
- WS-Notifications
- New Languages: XML in Use
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Google Web Toolkit
- GWT Basic Interactivity
- Google Sitemaps
- Google Sitemaps Resources
- Accessibility
- Web Accessibility
- XML Accessibility
- Accessibility Resources
- The Semantic Web
- Defining a New Ontology
- OWL: Web Ontology Language
- Semantic Web Resources
- Google Base
- Microformats
- StructuredBlogging
- Live Clipboard
- WML
- XHTML-MP
- WML Resources
- Google Web Services
- Google Web Services API
- Google Web Services Resources
- The Yahoo! Web Services Interface
- Yahoo! Web Services and PHP
- Yahoo! Web Services Resources
- eBay REST API
- WordML
- WordML Part 2: Lists
- WordML Part 3: Tables
- WordML Resources
- DocBook
- Articles
- Books and e-Books
- Official Documentation and Implementations
- XML Query
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- XForms
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Resource Description Framework (RDF)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Topic Maps
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation, Implementations, and Other Resources
- Rich Site Summary (RSS)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- Simple Sharing Extensions (SSE)
- Atom
- Podcasting
- Podcasting Resources
- Scalable Vector Graphics (SVG)
- Informit Articles and Sample Chapters
- Books and e-Books
- Official Documentation
- OPML
- OPML Resources
- Summary
- Projects
- JavaScript TimeTracker: JSON and PHP
- The Javascript Timetracker
- Refactoring to Javascript Objects
- Creating the Yahoo! Widget
- Web Mashup
- Google Maps
- Indeed Mashup
- Mashup Part 3: Putting It All Together
- Additional Resources
- Frequently Asked Questions About XML
- What's XML, and why should I use it?
- What's a well-formed document?
- What's the difference between XML and HTML?
- What's the difference between HTML and XHTML?
- Can I use XML in a browser?
- Should I use elements or attributes for my document?
- What's a namespace?
- Where can I get an XML parser?
- What's the difference between a well-formed document and a valid document?
- What's a validating parser?
- Should I use DOM or SAX for my application?
- How can I stop a SAX parser before it has parsed the entire document?
- 2005 Predictions
- 2006 Predictions
- Nick's Book Picks
Microsoft Office 2003's big claim to fame -- at least as far as I was concerned -- was the addition of support for XML. As far as Word 2003 is concerned, this support for falls into two categories: the ability to edit XML documents, and the ability to save an actual Word Document in WordProcessingML (AKA WordML), which enables you to preserve all of the formatting and other information you need to recreate the Word document later.
It's this second capability we're interested in now.
In earlier versions of Word, if you wanted to analyze the content of a Word document, you could add styles, save it as HTML, and then jump through about a dozen flaming hoops to turn that HTML into XML you could then analyze. (Can you tell I've been there, done that?
Word 2003 makes it easier, in that you don't have to kill yourself to get well-focused XML, but unless you understand the thinking behind it, WordML can seem impossibly couplex. Fortunately, once you do understand the general thinking, it's pretty straightforward, so let's go ahead and take a look at it.
The first thing you have to understand is the overall hierarchy of objects. It works like this: A document is made up of paragraphs. A paragraph is made up of text runs. A text run is made up of pieces of text. (There are other objects, of course, such as drawings and lists, but we'll worry about them in future columns.) That means we can create a very simple document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?> <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:body> <w:p> <w:r> <w:t>WordML -- XML in Microsoft Word 2003</w:t> </w:r> </w:p> </w:body> </w:wordDocument>
Notice that in the body of the document, we've got a paragraph that contains a single text run, which contains a single string of text.
Notice also the use of the namespace and processing instruction. The latter tells Microsoft Windows to use Word to open the document, the former tells Word how to treat it. If you have Word 2003, you can use it to open the document, as you can see in Figure 1.
Of course, this is of limited use in the real world, where there's formatting (and more than one paragraph, as a general rule) involved. For example, we'd like this title to be big, bold, and centered. To do that, we can add style information to the paragraph:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?> <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:body> <w:p> <w:pPr> <w:spacing w:before="240" w:after="60"/> <w:jc w:val="center"/> </w:pPr> <w:r> <w:t>WordML -- XML in Microsoft Word 2003</w:t> </w:r> </w:p> </w:body> </w:wordDocument>
That centers the text, as in Figure 2:
In this case, we're dealing with a "paragraph" style -- in other words,
you can apply it to one word, but it will affect the whole paragraph -- so
we added it using the pPr
element. By adding it at the start of
the paragraph, we're specifying that information takes effect for the whole
paragraph. You can also add "character" style information to text
runs using the rPr
element. For example:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?> <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:body> <w:p> <w:pPr> <w:spacing w:before="240" w:after="60"/> <w:jc w:val="center"/> </w:pPr> <w:r> <w:rPr> <w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial"/> <w:b/> <w:kern w:val="32"/> <w:sz w:val="32"/> </w:rPr> <w:t>WordML -- XML in Microsoft Word 2003</w:t> </w:r> </w:p> </w:body> </w:wordDocument>
You can see the results in Figure 3:
Now, strictly speaking, you can do any formating you need this way, but there isn't any way to understand what any of the content represents. For example, there's no way to know the headline is a headline, rather than simply a large block of text. Of course, WordML is part of a word processing program, dp if there was ever a time to favor presentation over structure, this is it. But that doesn't mean we have to settle for it.
No, instead, we can create styles and apply them to specific parts of the document. For example, we have 3 different styles in this document:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <?mso-application progid="Word.Document"?> <w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" > <w:styles> <w:style w:type="paragraph" w:default="on" w:styleId="Normal"> <w:name w:val="Normal"/> <w:rPr> <w:sz w:val="24"/> </w:rPr> </w:style> <w:style w:type="paragraph" w:styleId="CenteredHeading1"> <w:name w:val="Centered Heading One"/> <w:basedOn w:val="Normal"/> <w:next w:val="Normal"/> <w:pPr> <w:spacing w:before="240" w:after="60"/> <w:jc w:val="center"/> </w:pPr> <w:rPr> <w:rFonts w:ascii="Arial" w:h-ansi="Arial" w:cs="Arial"/> <w:b/> <w:kern w:val="32"/> <w:sz w:val="32"/> </w:rPr> </w:style> <w:style w:type="character" w:styleId="Callout"> <w:name w:val="Callout"/> <w:basedOn w:val="Normal"/> <w:next w:val="Normal"/> <w:rPr> <w:b/> <w:i/> </w:rPr> </w:style> </w:styles> <w:body> <w:p> <w:pPr> <w:pStyle w:val="CenteredHeading1"/> </w:pPr> <w:r> <w:t>WordML -- XML in Microsoft Word 2003</w:t> </w:r> </w:p> <w:p/> <w:p> <w:r> <w:t>Microsoft Office 2003's big claim to fame -- at least as far as I was concerned -- was the addition of support for XML. As far as Word 2003 is concerned, this support for falls into two categories: the ability to edit XML documents, and the ability to save an actual Word Document in </w:t> </w:r> <w:r> <w:rPr> <w:i/> </w:rPr> <w:t>Word Processing ML</w:t> </w:r> <w:r> <w:t>, which enables you to preserve all of the formatting and other information you need to recreate the Word document later.</w:t> </w:r> </w:p> <w:p/> <w:p> <w:pPr> <w:jc w:val="center"/> </w:pPr> <w:r> <w:rPr> <w:rStyle w:val="Callout"/> </w:rPr> <w:t>It's this second capability we're interested in now.</w:t> </w:r> </w:p> <w:p/> <w:p> <w:r> <w:t>In earlier versions of Word, if you wanted to analyze the content of a Word document, you could add styles, save it as HTML, and then jump through about a dozen flaming hoops to turn that HTML into XML you could then analyze. (Can you tell I've been there, done that?)</w:t> </w:r> </w:p> <w:p/> </w:body> </w:wordDocument>
We've got several items of note here. First notice that you identify a style
two ways. The name
is what appears in the "style" pulldown in Word,
while the styleId
is what you use to actually assign styles content,
as you can see in Figure 4:
The advantage of setting styles this way is that you can go in later with an XSLT style sheet and convert the presentation information into structure information, no flaming hoops required.