Parsing HTML with Swing
- Using HTMLEditorKit.Parser
- Using HTMLEditorKit.<br />ParserCallback
- Using HTML.Tag
- Using HTML.Attribute
- Conclusions
HTML processing is something that Java programs must commonly do. Although there are several third-party tools to do this for Java programs, Java actually contains HTML processing as part of Swing. In this article, I will show you how to make use of the HTML processing capabilities that are built into Java.
Although Swing contains HTML processing capabilities, it is not totally straightforward about how to use them. Swing needs HTML processing internally to display HTML text, but using HTML processing outside of Swing can be a bit more difficult. In the following sections, I will show you the classes that Swing makes available for you to use and how you can access them.
Using HTMLEditorKit.Parser
The Parser class, which is an inner class of the HTMLEditorKit class, is provided by Swing to facilitate the parsing of HTML. Actually, instantiating this class is not an easy task. It almost appears that the HTML parsing facilities of Swing were not meant to be used externally; instead, their availability is more a side effect than a feature. This is particularly evident by the way in which you must instantiate a class of HTMLEditorKit.Parser.
The only way to instantiate an HTMLEditorKit.Parser object is by overriding the getParser method of HTMLEditor kit to make it public. A class that does this is shown in Listing 1.
Listing 1: Gaining Access to the Swing HTML Parser
import javax.swing.text.html.*; public class HTMLParse extends HTMLEditorKit { /** * Call to obtain a HTMLEditorKit.Parser object. * * @return A new HTMLEditorKit.Parser object. */ public HTMLEditorKit.Parser getParser() { return super.getParser(); } }
Parser objects are instantiated by calling the getParser method of HTMLEditorKit. Unfortunately, this method does not have public access. The only way to call getParser is by overriding getParser to a public member function in a subclass. This is exactly what the HTMLParse class is used for. After you have obtained a Parser class, you should call the parse method of Parser and pass it a callback class.
StringReader r = new StringReader( ...html string... ); HTMLEditorKit.Parser parse = new HTMLParse().getParser();
The above code assumes that you have just retrieved an HTML page as a string. The page is used to create a StringReader that will then be passed to the parse method. The variable callback is assumed to hold a valid callback object.
parse.parse(r,callback,true);
This callback object is called repeatedly for each type of tag contained in the HTML stream. (The structure of a ParserCallback class is discussed in the next section.)