Reading and Writing RSS Feeds
Today you work with Extensible Markup Language (XML), a formatting standard that enables data to be completely portable.
You’ll explore XML in the following ways:
- Representing data as XML
- Discovering why XML is a useful way to store data
- Using XML to publish web content
- Reading and writing XML data
The XML format employed throughout the day is Really Simple Syndication (RSS), a popular way to publish web content and share information on site updates adopted by millions of sites.
Using XML
One of Java’s main selling points is that the language produces programs that can run on different operating systems without modification. The portability of software is a big convenience in today’s computing environment, where Windows, Linux, Mac OS, and a half dozen other operating systems are in wide use and many people work with multiple systems.
XML, which stands for Extensible Markup Language, is a format for storing and organizing data that is independent of any software program that works with the data.
Data that is compliant with XML is easier to reuse for several reasons.
First, the data is structured in a standard way, making it possible for software programs to read and write the data as long as they support XML. If you create an XML file that represents your company’s employee database, there are several dozen XML parsers that can read the file and make sense of its contents.
This is true no matter what kind of information you collect about each employee. If your database contains only the employee’s name, ID number, and current salary, XML parsers can read it. If it contains 25 items, including birthday, blood type, and hair color, parsers can read that, too.
Second, the data is self-documenting, making it easier for people to understand the purpose of a file just by looking at it in a text editor. Anyone who opens your XML employee database should be able to figure out the structure and content of each employee record without any assistance from you.
This is evident in Listing 19.1, which contains an RSS file. Because RSS is an XML dialect, it is structured under the rules of XML.
Listing 19.1. The Full Text of workbench.rss
1: <?xml version="1.0" encoding="utf-8"?> 2: <rss version="2.0"> 3: <channel> 4: <title>Workbench</title> 5: <link>http://www.cadenhead.org/workbench/</link> 6: <description>Programming, publishing, politics, and popes</description> 7: <docs>http://www.rssboard.org/rss-specification</docs> 8: <item> 9: <title>Toronto Star: Only 100 Blogs Make Money</title> 10: <link>http://www.cadenhead.org/workbench/news/3132</link> 11: <pubDate>Mon, 26 Feb 2007 11:30:57 -0500</pubDate> 12: <guid isPermaLink="false">tag:cadenhead.org,2007:weblog.3132</guid> 13: <enclosure length="2498623" type="audio/mpeg" 14: url="http://mp3.cadenhead.org/3132.mp3" /> 15: </item> 16: <item> 17: <title>Eliot Spitzer Files UDRP to Take EliotSpitzer.Com</title> 18: <link>http://www.cadenhead.org/workbench/news/3130</link> 19: <pubDate>Thu, 22 Feb 2007 18:02:53 -0500</pubDate> 20: <guid isPermaLink="false">tag:cadenhead.org,2007:weblog.3130</guid> 21: </item> 22: <item> 23: <title>Fuzzy Zoeller Sues Over Libelous Wikipedia Page</title> 24: <link>http://www.cadenhead.org/workbench/news/3129</link> 25: <pubDate>Thu, 22 Feb 2007 13:48:45 -0500</pubDate> 26: <guid isPermaLink="false">tag:cadenhead.org,2007:weblog.3129</guid> 27: </item> 28: </channel> 29: </rss>
Enter this text using a word processor or text editor and save it as plain text under the name workbench.rss. (You can also download a copy of it from the book’s website at http://www.java21days.com on the Day 19 page.)
Can you tell what the data represents? Although the ?xml tag at the top might be indecipherable, the rest is clearly a website database of some kind.
The ?xml tag in the first line of the file has a version attribute with a value of 1.0 and an encoding attribute of "utf-8". This establishes that the file follows the rules of XML 1.0 and is encoded with the UTF-8 character set.
Data in XML is surrounded by tag elements that describe the data. Opening tags begin with a “<” character followed by the name of the tag and a “>” character. Closing tags begin with the “</” characters followed by a name and a “>” character. In Listing 19.1, for example, <item> on line 8 is an opening tag, and </item> on line 15 is a closing tag. Everything within those tags is considered to be the value of that element.
Elements can be nested within other elements, creating a hierarchy of XML data that establishes relationships within that data. In Listing 19.1, everything in lines 9–14 is related; each element defines something about the same website item.
Elements also can include attributes, which are made up of data that supplements the rest of the data associated with the element. Attributes are defined within an opening tag element. The name of an attribute is followed by an equal sign and text within quotation marks.
In line 12 of Listing 19.1, the guid element includes an isPermaLink attribute with a value of "false". This indicates that the element’s value, “tag:cadenhead.org,2007:weblog.3132”, is not a permalink, the URL at which the item can be loaded in a browser.
XML also supports elements defined by a single tag rather than a pair of tags. The tag begins with a “<” character followed by the name of the tag and ends with the “/>” characters. The RSS file includes an enclosure element in lines 13–14 that describes an MP3 audio file associated with the item.
XML encourages the creation of data that’s understandable and usable even if the user doesn’t have the program that created it and cannot find any documentation that describes it.
The purpose of the RSS file in Listing 19.1 can be understood, for the most part, simply by looking at it. Each item represents a web page that has been updated recently.
Data that follows XML’s formatting rules is said to be well-formed. Any software that can work with XML reads and writes well-formed XML data.