- Sams Teach Yourself XML in 21 Days, Third Edition
- Table of Contents
- About the Author
- Acknowledgments
- We Want to Hear from You!
- Introduction
- Part I: At a Glance
- Day 1. Welcome to XML
- All About Markup Languages
- All About XML
- Looking at XML in a Browser
- Working with XML Data Yourself
- Structuring Your Data
- Creating Well-Formed XML Documents
- Creating Valid XML Documents
- How XML Is Used in the Real World
- Online XML Resources
- Summary
- Q&A
- Workshop
- Day 2. Creating XML Documents
- Choosing an XML Editor
- Using XML Browsers
- Using XML Validators
- Creating XML Documents Piece by Piece
- Creating Prologs
- Creating an XML Declaration
- Creating XML Comments
- Creating Processing Instructions
- Creating Tags and Elements
- Creating CDATA Sections
- Handling Entities
- Summary
- Q&A
- Workshop
- Day 3. Creating Well-Formed XML Documents
- What Makes an XML Document Well-Formed?
- Creating an Example XML Document
- Understanding the Well-Formedness Constraints
- Using XML Namespaces
- Understanding XML Infosets
- Understanding Canonical XML
- Summary
- Q&A
- Workshop
- Day 4. Creating Valid XML Documents: DTDs
- All About DTDs
- Validating a Document by Using a DTD
- Creating Element Content Models
- Commenting a DTD
- Supporting External DTDs
- Handling Namespaces in DTDs
- Summary
- Q&A
- Workshop
- Declaring Attributes in DTDs
- Day 5. Handling Attributes and Entities in DTDs
- Specifying Default Values
- Specifying Attribute Types
- Handling Entities
- Summary
- Q&A
- Workshop
- Day 6. Creating Valid XML Documents: XML Schemas
- Using XML Schema Tools
- Creating XML Schemas
- Dissecting an XML Schema
- The Built-in XML Schema Elements
- Creating Elements and Types
- Specifying a Number of Elements
- Specifying Element Default Values
- Creating Attributes
- Summary
- Q&A
- Workshop
- Day 7. Creating Types in XML Schemas
- Restricting Simple Types by Using XML Schema Facets
- Creating XML Schema Choices
- Using Anonymous Type Definitions
- Declaring Empty Elements
- Declaring Mixed-Content Elements
- Grouping Elements Together
- Grouping Attributes Together
- Declaring all Groups
- Handling Namespaces in Schemas
- Annotating an XML Schema
- Summary
- Q&A
- Workshop
- Part I. In Review
- Well-Formed Documents
- Valid Documents
- Part II: At a Glance
- Day 8. Formatting XML by Using Cascading Style Sheets
- Our Sample XML Document
- Introducing CSS
- Connecting CSS Style Sheets and XML Documents
- Creating Style Sheet Selectors
- Using Inline Styles
- Creating Style Rule Specifications in Style Sheets
- Summary
- Q&A
- Workshop
- Day 9. Formatting XML by Using XSLT
- Introducing XSLT
- Transforming XML by Using XSLT
- Writing XSLT Style Sheets
- Using <xsl:apply-templates>
- Using <xsl:value-of> and <xsl:for-each>
- Matching Nodes by Using the match Attribute
- Working with the select Attribute and XPath
- Using <xsl:copy>
- Using <xsl:if>
- Using <xsl:choose>
- Specifying the Output Document Type
- Summary
- Q&A
- Workshop
- Day 10. Working with XSL Formatting Objects
- Introducing XSL-FO
- Using XSL-FO
- Using XSL Formatting Objects and Properties
- Building an XSL-FO Document
- Handling Inline Formatting
- Formatting Lists
- Formatting Tables
- Summary
- Q&A
- Workshop
- Part II. In Review
- Using CSS
- Using XSLT
- Using XSL-FO
- Part III: At a Glance
- Day 11. Extending HTML with XHTML
- Why XHTML?
- Writing XHTML Documents
- Validating XHTML Documents
- The Basic XHTML Elements
- Organizing Text
- Formatting Text
- Selecting Fonts: <font>
- Comments: <!-->
- Summary
- Q&A
- Workshop
- Day 12. Putting XHTML to Work
- Creating Hyperlinks: <a>
- Linking to Other Documents: <link>
- Handling Images: <img>
- Creating Frame Documents: <frameset>
- Creating Frames: <frame>
- Creating Embedded Style Sheets: <style>
- Formatting Tables: <table>
- Creating Table Rows: <tr>
- Formatting Table Headers: <th>
- Formatting Table Data: <td>
- Extending XHTML
- Summary
- Q&A
- Workshop
- Day 13. Creating Graphics and Multimedia: SVG and SMIL
- Introducing SVG
- Creating an SVG Document
- Creating Rectangles
- Adobe's SVG Viewer
- Using CSS Styles
- Creating Circles
- Creating Ellipses
- Creating Lines
- Creating Polylines
- Creating Polygons
- Creating Text
- Creating Gradients
- Creating Paths
- Creating Text Paths
- Creating Groups and Transformations
- Creating Animation
- Creating Links
- Creating Scripts
- Embedding SVG in HTML
- Introducing SMIL
- Summary
- Q&A
- Workshop
- Day 14. Handling XLinks, XPointers, and XForms
- Introducing XLinks
- Beyond Simple XLinks
- Introducing XPointers
- Introducing XBase
- Introducing XForms
- Summary
- Workshop
- Part III. In Review
- Part IV: At a Glance
- Day 15. Using JavaScript and XML
- Introducing the W3C DOM
- Introducing the DOM Objects
- Working with the XML DOM in JavaScript
- Searching for Elements by Name
- Reading Attribute Values
- Getting All XML Data from a Document
- Validating XML Documents by Using DTDs
- Summary
- Q&A
- Workshop
- Day 16. Using Java and .NET: DOM
- Using Java to Read XML Data
- Finding Elements by Name
- Creating an XML Browser by Using Java
- Navigating Through XML Documents
- Writing XML by Using Java
- Summary
- Q&A
- Workshop
- Day 17. Using Java and .NET: SAX
- An Overview of SAX
- Using SAX
- Using SAX to Find Elements by Name
- Creating an XML Browser by Using Java and SAX
- Navigating Through XML Documents by Using SAX
- Writing XML by Using Java and SAX
- Summary
- Q&A
- Workshop
- Day 18. Working with SOAP and RDF
- Introducing SOAP
- A SOAP Example in .NET
- A SOAP Example in Java
- Introducing RDF
- Summary
- Q&A
- Workshop
- Part IV. In Review
- Part V: At a Glance
- Day 19. Handling XML Data Binding
- Introducing DSOs
- Binding HTML Elements to HTML Data
- Binding HTML Elements to XML Data
- Binding HTML Tables to XML Data
- Accessing Individual Data Fields
- Binding HTML Elements to XML Data by Using the XML DSO
- Binding HTML Tables to XML Data by Using the XML DSO
- Searching XML Data by Using a DSO and JavaScript
- Handling Hierarchical XML Data
- Summary
- Q&A
- Workshop
- Day 20. Working with XML and Databases
- XML, Databases, and ASP
- Storing Databases as XML
- Using XPath with a Database
- Introducing XQuery
- Summary
- Q&A
- Workshop
- Day 21. Handling XML in .NET
- Creating and Editing an XML Document in .NET
- From XML to Databases and Back
- Reading and Writing XML in .NET Code
- Using XML Controls to Display Formatted XML
- Creating XML Web Services
- Summary
- Q&A
- Workshop
- Part V. In Review
- Appendix A. Quiz Answers
- Quiz Answers for Day 1
- Quiz Answers for Day 2
- Quiz Answers for Day 3
- Quiz Answers for Day 4
- Quiz Answers for Day 5
- Quiz Answers for Day 6
- Quiz Answers for Day 7
- Quiz Answers for Day 8
- Quiz Answers for Day 9
- Quiz Answers for Day 10
- Quiz Answers for Day 11
- Quiz Answers for Day 12
- Quiz Answers for Day 13
- Quiz Answers for Day 14
- Quiz Answers for Day 15
- Quiz Answers for Day 16
- Quiz Answers for Day 17
- Quiz Answers for Day 18
- Quiz Answers for Day 19
- Quiz Answers for Day 20
- Quiz Answers for Day 21
Creating XML Documents Piece by Piece
Yesterday, you created this example XML document:
<?xml version="1.0" encoding="UTF-8"?> <document> <heading> Hello From XML </heading> <message> This is an XML document! </message> </document>
That's a fully-functional XML document, but it's only an example. Today, we're going to be more systematic about what goes into an XML document, discussing all the possible parts of such documents. You'll take a look at these parts of an XML document in the coming sections:
- Prologs
- XML declarations
- Processing instructions
- Elements and attributes
- Comments
- CDATA sections
- Entities
W3C defines everything that can go into XML documents in the XML 1.0 and 1.1 specifications, right down to our starting point—the character set you use.
Character Encodings: ASCII, Unicode, and UCS
The characters in an XML document are stored using numeric codes. That can be an issue, because different character sets use different codes, which means an XML processor might have problems trying to read an XML document that uses a character set—called a character encoding—other than what it's used to.
For example, a common character encoding used by text editors is the American Standard Code for Information Interchange (ASCII). ASCII is the default for plain text files created with Windows WordPad. ASCII codes extend from 0 to 255—for example, the ASCII code for A is 65, for B is 66, and so on. So, if you stored the word cat in an XML document written in ASCII, the numbers 67, 65, and 84 are what would actually be stored. On the other hand, the World Wide Web is just that—worldwide. Plenty of character sets can't fit into the 256 characters of ASCII, such as Cyrillic, Armenian, Hebrew, Thai, Tibetan, and so on.
For that reason, W3C turned to Unicode (http://www.unicode.org), which holds 65,536 characters, not just 256 (although only about 40,000 Unicode codes are reserved at this point). To make things easier, the first 256 Unicode characters correspond to the ASCII character set.
There's another character encoding available that has even more space than Unicode—the Universal Character System (UCS, also called ISO 10646) uses 32 bits—two bytes—per character. This gives it a range of two billion symbols—and a good thing, too, since there are more Chinese characters alone than there is space in Unicode. UCS also encompasses the smaller Unicode character set—each Unicode character is represented by the same code in UCS, in much the same way that Unicode encompasses the smaller ASCII character set.
So which character sets are supported in XML? ASCII? Unicode? UCS? Unicode uses two bytes for each character, so a Unicode file would be twice as long as an ASCII file. For that and other reasons, it's difficult to convert much of the available software to Unicode. XML actually supports a compressed version of Unicode created by the UCS group called UCS Transformation Format-8 (UTF-8). UTF-8 includes all the ASCII codes unchanged, and uses a single byte for the most common Unicode characters. Any other Unicode characters need more than one byte (and can use up to six)—for example, the Unicode for is 03C0 in hexadecimal (960 in decimal), which you need to store in two bytes.
To make it easier to handle, UCS itself has also been compressed in the same way into a character set named UTF-16, which uses two bytes (instead of the normal four that UCS uses) for the most common characters, and more bytes for the less common characters.
W3C requires all XML processors to support both UTF-8 (compressed Unicode, including the full ASCII set), and UTF-16 (compressed UCS, including the full ASCII set), and those are the only two W3C requires. The UTF-8 encoding is the most popular one today in XML documents, because you can store documents in ASCII using a text editor and they can be treated, without any changes, as UTF-8 by an XML processor (ASCII uses one byte for characters, and UTF-8 uses one byte for the most common characters, including all the characters in the ASCII set). In fact, we've been using UTF-8 since our first XML example, as you can see where we've specified the character encoding for a document with the encoding attribute in the XML declaration:
<?xml version="1.0" encoding="UTF-8"?> <document> <heading> Hello From XML </heading> <message> This is an XML document! </message> </document>
UTF-8 is so widespread that an XML processor will assume you're using it if you omit the encoding attribute. Although W3C requires all XML processors to support UTF-16 and UTF-8 (so you can assign these values to the encoding attribute), most don't support UTF-16 yet.
Although only UTF-8 and UTF-16 are required, there are many character encodings that an XML processor can support, such as the following:
- US-ASCII— U.S. ASCII
- UTF-8— Compressed Unicode
- UTF-16— Compressed UCS
- ISO-10646-UCS-2— Unicode
- ISO-10646-UCS-4— UCS
- ISO-2022-JP— Japanese
- ISO-2022-CN— Chinese
- ISO-8859-5— ASCII and Cyrillic
The increasing adoption of Unicode is the main driving force behind XML 1.1. There are three main areas in which XML 1.1 differs from XML 1.0, all having to do with characters:
- XML 1.1 accepts more Unicode characters than were available when XML 1.0 was created. (XML 1.0 was created when Unicode version 2.0 was current; now version 4.0 is being tested.)
- XML 1.1 relaxes some rules of creating names (as used for elements and attributes) to allow more Unicode characters, and to permit for Unicode expansion in the future.
- XML 1.1 permits more legal characters you can use to end a line.
You'll see these various points in more depth today. However, note that most of these differences are technical, and won't concern you a great deal. For example, XML 1.0 and 1.1 differ slightly in what character references you can use. As in HTML, character reference stands for a Unicode character and begins with &, followed by a numeric code specifying a character, and ends with ;. You can either enter a Unicode character in an XML document as the character itself or as a character reference, which the XML processor will convert into the corresponding character.
For example, the Unicode for is 960 in decimal, so you can embed in your XML document by entering (if your text editor supports Unicode), or as the character reference π (if your text editor doesn't support Unicode). The XML processor will replace the character reference with . (You can also give the Unicode in hexadecimal if you preface it with an x, which would be π in this case.)
The difference between XML 1.0 and XML 1.1 as far as character references go is that XML 1.1 allows the use of character references  through , most of which are forbidden in XML 1.0. Conversely, the character references  through Ÿ, which were allowed as characters or character references in XML 1.0 documents, might only appear as character references in XML 1.1. These kinds of relatively small differences aren't going to concern us a great deal. For all these details, check the XML 1.1 candidate recommendation itself.
That's given us a handle on the character encodings you can use to create XML documents. The next step is to see just how you put those characters to work in XML as you create markup and text data.
Understanding XML Markup and XML Data
At their most basic level, XML documents are combinations of markup and text data. They might also include binary data one day, but there's no way to include binary data in an XML document at the moment. (If you want to associate binary data with an XML document, you keep that data external to the document and use an entity reference, as you'll see later today and in Day 5 in detail.)
The markup in a document gives it its structure. Markup includes start tags, end tags, empty element tags, entity references, character references, comments, CDATA section delimiters (more about CDATA sections in a few pages), document type declarations, and processing instructions. What about the data in an XML document? All the text in an XML document that is not markup is data.
Although the markup we've seen has mostly consisted of tags up to this point, there's another type of markup that doesn't use tags—general entity references and parameter entity references. Whereas tags begin with < and end with >, general entity references start with & and end with ; (as with the character references we've already seen, which are general entity references—for example, if you're using the UTF-16 encoding, π is a character reference for ). General entity references are replaced by the entity they refer to when the document is parsed. Parameter entity references, which start with % and end with ;, are used in DTDs, as we'll see in Days 4 and 5.
For example, the markup < is a general entity reference that is turned into a < (less than) symbol when parsed by an XML processor, and the general entity reference > is turned into a > (greater than) symbol when parsed by an XML processor. You can see an example using these general entity references in Listing 2.1.
Example 2.1. Using an Entity Reference (ch02_01.xml)
<?xml version="1.0" encoding="UTF-8"?> <document> <heading> Hello From XML </heading> <message> This text is inside a <message> element. </message> </document>
You can see ch02_01.xml in Internet Explorer in Figure 2.9. As you can see in the figure, the markup < was turned into a <, and the markup > was turned into a > by the XML processor.
Figure 2.9 Using markup in Internet Explorer.
Besides character entity references, where a character code is replaced by the character it stands for, there are five predefined general entity references in XML, which are used when browsers might otherwise assume that they're part of markup to be interpreted:
- <— Replaced with <
- >— Replaced with >
- &— Replaced with &
- "— Replaced with "
- '— Replaced with '
You can also create your own general entity references, which we'll do in Day 5.
When an XML processor parses your XML, it replaces general entity references like > with the entity those references stand for, which is > in this case. Before it's parsed, text data is called character data; after it's been parsed and general entity references have been replaced with the entities they refer to, the text data is called parsed character data.
Using Whitespace and Ends of Lines
Spaces, carriage returns, line feeds, and tabs are all treated as whitespace in XML. That means that to an XML processor, this XML document:
<?xml version="1.0" encoding="UTF-8"?> <document> <heading> Hello From XML </heading> <message> This is an XML document! </message> </document>
is the same as this one, in terms of content:
<?xml version="1.0" encoding="UTF-8"?> <document>heading>Hello From XML</heading> <message>This is an XML document!</message></document>
You can use a special attribute named xml:space in an element to indicate that you want whitespace to be preserved by XML processors (not all XML processors will support this attribute). You can set this attribute to "default" to indicate that the default handling of whitespace is OK for the current element and all contained elements, or "preserve" to indicate that you want all applications to preserve whitespace as it is in the document. This is useful if the XML processor is going to display the XML document visually:
<?xml version="1.0" encoding="UTF-8"?> <document xml:space="preserve"> <heading> Hello From XML </heading> <message> This is an XML document! </message> </document>
In XML 1.0, lines officially end with a linefeed character (ASCII and UTF-8 code 10—the Unix way of ending lines). In MS DOS and some Windows programs, lines can end with a carriage return (ASCII and UTF-8 code 13) linefeed pair, but when parsed by an XML processor, that pair (codes 13 and 10) is converted into a single linefeed (ASCII and UTF-8 code 10). In XML 1.1, which is mostly about expanding the character sets you can use, XML 1.0 was considered to discriminate against the conventions used on IBM and IBM-compatible mainframes. That means that in XML 1.1, the acceptable line endings that XML processors are supposed to convert to 
 are expanded to include the following:
- The two-character sequence 
 

- The two-character sequence 
 … (… is the New Line (NEL) character in many mainframes.)
- The single character …
- The single character 
 (This is the Unicode line separator character.)
- Any 
 character not immediately followed by 
 or ….
That brings us up through the basic structure of an XML document—markup and data. Now it's time to actually start putting markup and data to work as you start creating XML documents.