Program Specs
The steps previously described have been implemented in a small toolkit written in the Python language. Here are some details:
- RSS feeds are grabbed directly from within the tool using the Python URL library (urllib).
- RSS parsing is performed using the Python XML toolkit.
- Stemming uses the Porter stemming algorithm.
- The output was fed to the neato program from Graphviz.
A list of more than 60 individual RSS feeds from several dozen individual world news sites were processed using this technique. Of 1,154 unique stories, 458 were correlated to other stories within this data set. The clustering that resulted from this process is shown in Figure 4. The process took approximately two minutes on a Pentium 4 1.4 GHz processor running Python 2.3, which included fetching all URLs listed in the list of feeds. Processing using neato took approximately three minutes.
Figure 4 Sample output of RSS clustering with input data from 66 sources (NYT, BBC, AP, UPI, etc.). This data was gathered on 31 August 2004. The clustering output was processed using the neato tool from the Graphviz toolkit.
Headlines at the center of the cluster are the story with the most detail in the story description, not necessarily the one closest to all of the stories. With Adobe's SVG viewer, which supports Linux, Windows, and OS X, the user would be able to use the right mouse button to get the SVG contextual menu, and then zoom or move the image.
At this point, no metrics are included to evaluate the quality of the groupings. This makes it difficult to understand the impact of any intended improvements on this approach.
A more practical use of the mechanism is shown in the next couple of figures. I've used these methods to develop a personal news site called Monkey News, using several dozen news feeds from around the world. Because of the structure of a typical news feed, subjects are clearly indicated. A typical news headline, for example, usually includes a subject and an action, and often a geographic location. Most daily newspapers follow this format for regular stories, which must clearly indicate to the reader which events are news and what they may want to read. Magazines and other, less-frequent sources of news often don't identify the subject so clearly. These sources are difficult to use in the aggregation system without deeper analysis.
In this system, the top six topics are indicated in the header. Stories are arranged in descending order from most popular to least popular and are gathered by topic. The system presents this information as a static HTML page updated every two hours (see Figure 5).
Figure 5 Image capture of the front page of the Monkey News site, showing the top six subjects and the first two clusters of stories grouped by topic. Stories appear in descending order of popularity as determined by topic mention from the various sites.
An individual group is shown in Figure 6. The top of the grouping includes the headline, source identification, and a small paragraph describing the topic. Additional sources are grouped underneath this initial story, allowing for browsing of these stories.
Figure 6 Screen capture of an individual story block from Monkey News. The grouping is clearly identifiable based on the topic, and the usability of a single group is evident. The top story in the group has the description of the story, with supplemental links and sources clearly identified.
Using this method, the number of news sources can be easily expanded without increasing the burden on the reader. Instead, the accuracy of the groups is improved. This data is inferred from the quantity of sources "talking" about a subject, which provides a "vote" to the popularity of the topic. This data is inherent in a collection of stories and feeds and doesn't come from external sources.
Monkey News has proven to be a useful site. During the 2004 presidential election, for example, the viewpoints of various news sites were constantly available. During the invasion of Iraq by U.S.-led forces, a constant stream of updates were available from various sources both foreign and domestic. While this information could have been gathered using a traditional RSS aggregator, the number of stories would have been overwhelming. This new approach clearly reduces clutter and improves readability over the basic, flat approach of an RSS aggregator. More than 1,000 stories a day are gathered, grouped, and presented using this system.