The preceding chapters have shown most of what you'll need to know to use SAX2 effectively, but as individual techniques, in small bits and pieces. In this chapter, we'll look at more substantial examples, which tie those techniques together. The examples here should help you to understand the kinds of modules you'll need to put together similar SAX2-based applications. You'll also see some of the options you have for building larger processing tasks from SAX components.
One of the first popular XML document standards is hidden in the guts of web site management toolsets. It dates to back when XML wasn't fully crystallized. Back then, there was a lot of interest in using XML to address a widespread problem: how to tell users about updates to web sites so they didn't need to read the site several times a day. A "channel" based model was widely accepted, building on the broadcast publishers' analogy of a web site as a TV channel. Microsoft shipped an XML-like format called Channel Definition Format (CDF), and other update formats were also available, but the solution that caught on was from Netscape. It is called RSS. This originally stood for "RDF Site Summary,"[24] but it was simplified and renamed the "Rich Site Summary" format before it saw any wide adoption.
[24]RDF stands for Resource Description Framework. For more information, see http://www.w3.org/RDF/.
RSS 0.91 was the mechanism used to populate one of the earliest customizable web portals, My Netscape. The mechanism is simple: RSS presents a list of recently updated items from the web site, with summaries, as an XML file that could be fetched across the Web. Sites could update static summary files along with their content or generate them on the fly; site management tools could do either task automatically. It was easy for sites to create individualized views that aggregated the latest news from any of the numerous web sites providing RSS feeds.
There's essentially been a fork in the development of RSS. In recent surveys, about two thirds of the RSS sites use "RSS Classic," based on the 0.91 DTD and often with 0.92 extensions. (Mostly, the 0.92 spec removed limits from the non-DTD parts of the 0.91 spec.) Relatively recently, "New RSS" was created. Also called "RSS 1.0" (though not with the support of all the developers who had been enhancing RSS), this version is more complex. It uses RDF and XML namespaces and includes a framework with extension modules to address the complex content syndication and aggregation requirements of larger web sites. RSS toolkits tend to support both formats, but RDF itself is still not widely adopted. This is what part of one "RSS Classic" feed looks like, from the URL http://xmlhack.com/rss.php:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd"> <rss version="0.91"> <channel> <title>xmlhack</title> <link>http://www.xmlhack.com</link> <description>Developer news from the XML community</description> <language>en-us</language> <managingEditor>[email protected]</managingEditor> <webMaster>[email protected]</webMaster> <item> <title>BEEP implementation for .NET/C#</title> <link>http://www.xmlhack.com/read.php?item=1470</link> </item> <item> <title>MinML-RPC, Sandstorm XML-RPC framework</title> <link>http://www.xmlhack.com/read.php?item=1469</link> </item> <item> <title>XSLT as query language</title> <link>http://www.xmlhack.com/read.php?item=1467</link> </item> <item> <title>Exclusive XML Canonicalization in Last Call</title> <link>http://www.xmlhack.com/read.php?item=1466</link> </item> <!--many items were deleted for this example--> </channel> </rss>
In this section we use some of the techniques we've seen earlier and will look at both sides (client and server) of some simple RSS tools for RSS Classic. A full RSS toolset would need to handle New RSS, and would likely need an RDF engine to work with RDF metadata. Such RDF infrastructure should let applications work more with the semantics of the data, and would need RDF schema support. That's all much too complex to show here.[25]
[25] If you're interested the RDF approach, look at sites like the Open Directory Project, at http://www.dmoz.org/, to see one way of using RDF.
First we'll build a simple custom data model, then write the code to marshal and unmarshal it, and finally see how those components fit into common types of RSS applications. In a microcosm, this is what lots of XML applications do: read XML into custom data structures, process them, and then write out more XML.
Here are the key parts of the RSS 0.91 DTD; it also incorporates the HTML 4.0 ISO Latin/1 character entities, which aren't shown here, and various other integrity rules that aren't expressed by this DTD:
<!ELEMENT rss (channel)> <!ATTLIST rss version CDATA #REQUIRED> <!-- must be "0.91"> --> <!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? | managingEditor? | webMaster? | skipHours? | skipDays?)*> <!ELEMENT image (title | url | link | width? | height? | description?)*> <!ELEMENT item (title | link | description)*> <!ELEMENT textinput (title | description | name | link)*> <!ELEMENT title (#PCDATA)> <!ELEMENT description (#PCDATA)> <!ELEMENT link (#PCDATA)> <!ELEMENT url (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT rating (#PCDATA)> <!ELEMENT language (#PCDATA)> <!ELEMENT width (#PCDATA)> <!ELEMENT height (#PCDATA)> <!ELEMENT copyright (#PCDATA)> <!ELEMENT pubDate (#PCDATA)> <!ELEMENT lastBuildDate (#PCDATA)> <!ELEMENT docs (#PCDATA)> <!ELEMENT managingEditor (#PCDATA)> <!ELEMENT webMaster (#PCDATA)> <!ELEMENT hour (#PCDATA)> <!ELEMENT day (#PCDATA)> <!ELEMENT skipHours (hour+)> <!ELEMENT skipDays (day+)>
In short, the DTD includes a wrapper that gives the version, one channel with some descriptive data, and a bunch of items. RSS 0.92 changes it slightly. Data length limits (which a DTD can't describe) are removed, and a bit more. If you're working with RSS, you should know that most RSS feeds incorporate at least a few of those 0.92 extensions and have your code handle the issues. And if you're generating an RSS feed for your web site, you'll want to know that many aggregators present the image as the channel's icon, along with the newest items and the text input box, to provide quick access to your site.
When you work with XML-based systems and SAX, one of the first things you'll want to do is decide on the data structures you'll use. Sometimes you'll have a pre-existing data structure that must be matched; in cases like this RSS code, you have the luxury of a blank slate to write on. I'm a big believer in designing appropriate data structures, rather than expecting some development tool to come up with a good answer; as a rule, a good "manual" design beats code generator output in any maintainable system. In the case of RSS Classic, simple structures like those shown in Example 6-1 can do the job:
import java.util.Vector; public class RssChannel { // (optional, not part of RSS) URI for the RSS file public String sourceUri; // Five required items public String description = ""; public Vector items = new Vector (); public String language = ""; public String link = ""; public String title = ""; // Lots of optional items public String copyright = ""; public String docs = ""; public RssImage image; public String lastBuildDate = ""; public String managingEditor = ""; public String pubDate = ""; public String rating = ""; // public Days skipDays; // public Hours skipHours; public RssTextInput textinput; public String webMaster = ""; // channels have a bunch of items static public class RssItem { public String description = ""; public String link = ""; public String title = ""; } // Text input is used to query the channel static public class RssTextInput { public String description = ""; public String link = ""; public String name = ""; public String title = ""; } // Image used for the channel static public class RssImage { public String link = ""; public String title = ""; public String url = ""; // optional public String description = ""; public String height = ""; public String width = ""; } }
Note that these classes didn't include any methods; methods can be added later, as application code determines what's really necessary. There are a variety of features that would be good to constrain this way, which you'll see if you look at the RSS specifications. Even pure "value objects" benefit from such internal consistency checks. For example, you may prefer to use beans-style accessor functions, but they would only complicate this example. (So would the class and field documentation, which has been deleted for simplicity.)
There's one type of code that is certainly needed but was intentionally put into different classes: marshaling data to RSS and unmarshaling it from RSS. Such choices are design policies; while it's good to keep marshaling code in one place, that place doesn't need to be the data structure class itself. It's good to separate marshaling code and data structure code because it's easier to support several different kinds of input and output syntax. Examples include different versions of RSS, as well as transfers to and from databases with JDBC. To display RSS in a web browser, different versions of HTML may be appropriate. Sometimes, embedding a stylesheet processing instruction into the XML text may be the way to go. Separate marshaling code needs attention when data structures change, but good software maintenance procedures will ensure that's never a problem.
Earlier chapters have touched on ways to marshal and unmarshal data with SAX. This section shows these techniques more completely, for a real-world application data model.
Example 6-2 shows what SAX-based unmarshaling code can look like, without the parser hookup. In this case it's set up to be the endpoint on a pipeline. This just turns infoset "atoms" into RSS "molecules" and stops. Note that it isn't particularly thorough in how it handles all the various types of illegal, or just unexpected, RSS that's found on the Web, although it handles many RSS Classic sites perfectly well. For example, the controls to skip fetches on particular days (perhaps weekends) or hours (nonbusiness hours) aren't usually supported, so they're just ignored here. With a more complex DTD, unmarshaling might not be able to rely on such a simple element stacking scheme; you might need to stack the objects you're unmarshaling and use a more complex notion of context to determine the appropriate actions to take.
import java.util.Stack; import RssChannel.RssItem; import RssChannel.RssImage; import RssChannel.RssTextInput; public class RssConsumer extends DefaultHandler { private RssChannel channel; private RssItem item; private RssImage image; private RssTextInput input; private Stack stack; private Locator locator; public RssChannel getChannel () { return channel; } private String getCurrentElementName () { return (String) stack.peek (); } // only need a handful of ContentHandler methods public void setDocumentLocator (Locator l) { locator = l; } public void startDocument () throws SAXException { channel = new RssChannel (); if (locator != null) channel.sourceUri = locator.getSystemId (); stack = new Stack (); } public void startElement ( String namespace, String local, String name, Attributes attrs ) throws SAXException { stack.push (name); if ("item".equals (name)) item = new RssItem (); else if ("image".equals (name)) image = new RssImage (); else if ("textinput".equals (name)) input = new RssTextInput (); // parser misconfigured? else if (name.length () == 0) throw new SAXParseException ("XML names not available", locator); } public void characters (char buf [], int off, int len) throws SAXException { String top = getCurrentElementName (); String value = new String (buf, off, len); if ("title".equals (top)) { if (item != null) item.title += value; else if (image != null) image.title += value; else if (input != null) input.title += value; else channel.title += value; } else if ("description".equals (top)) { if (item != null) item.description += value; else if (image != null) image.description += value; else if (input != null) input.description += value; else channel.description += value; } else if ("link".equals (top)) { if (item != null) item.link += value; else if (image != null) image.link += value; else if (input != null) input.link += value; else channel.link += value; } else if ("url".equals (top)) { image.url += value; } else if ("name".equals (top)) { input.name += value; } else if ("language".equals (top)) { channel.language += value; } else if ("managingEditor".equals (top)) { channel.managingEditor += value; } else if ("webMaster".equals (top)) { channel.webMaster += value; } else if ("copyright".equals (top)) { channel.copyright += value; } else if ("lastBuildDate".equals (top)) { channel.lastBuildDate += value; } else if ("pubDate".equals (top)) { channel.pubDate += value; } else if ("docs".equals (top)) { channel.docs += value; } else if ("rating".equals (top)) { channel.rating += value; } // else ignore ... skipDays and so on. } public void endElement ( String namespace, String local, String name ) throws SAXException { if ("item".equals (name)) { // patch item.link channel.items.addElement (item); item = null; } else if ("image".equals (name)) { // patch image.link // (patch image.url) channel.image = image; image = null; } else if ("textinput".equals (name)) { // patch input.link channel.textinput = input; input = null; } else if ("channel".equals (name)) { // patch channel.link } } }
If you think in terms of higher-level parsing events, rather than in terms of data structures, you might want to define an application-level event handler interface and package your code as an XMLFilterImpl, as shown in Example 6-3. This is the "atoms into molecules" pattern for handlers, as sketched in Chapter 3, "Producing SAX2 Events". In the case of RSS, both item and channel might reasonably be expected to be "molecules" that get reported individually as application-level events. If you report finer grained structures (like item) it might be it easier to assemble higher-level data structures, but we won't show that here.
public interface RssHandler { void channelUpdate (RssChannel c) throws SAXException; } public class RssConsumer extends XMLFilterImpl { // ... as above (notice different base class!) but also: private RssHandler handler; public static String RssHandlerURI = "http://www.example.com/properties/rss-handler"; public void setProperty (String uri, Object value) throws SAXNotSupportedException, SAXNotRecognizedException { if (RssHandlerURI.equals (uri)) { if (value instanceof RssHandler) { handler = (RssHandler) value; return; } throw new SAXNotSupportedException ("not an RssHandler"); } super.setProperty (uri, value); } public Object getProperty (String uri) throws SAXNotSupportedException, SAXNotRecognizedException { if (RssHandlerURI.equals (uri)) return handler; return super.getProperty (uri); } public void endDocument () throws SAXException { if (handler == null) return; handler.channelUpdate (getChannel ()); } }
A filter written in that particular way can be used almost interchangeably with the handler-only class shown earlier in Example 6-2. In fact it's just a bit more flexible than that, though it may not be a good pipeline-style component. That's because it doesn't pass the low-level events through consistently; the ContentHandler methods this implements don't pass their events through to the superclass, but all the other methods do. That's easily fixed, but it's likely that you'd either want all the XML atoms to be visible (extending the XML Infoset with RSS-specific data abstractions) or none of them (and use an RSS-only infoset).
Example 6-4 shows what the core marshaling code can look like, without the hookup to an XMLWriter or the XMLWriter setup. For simplicity, this example takes a few shortcuts: it doesn't marshal the channel's icon description or most of the other optional fields. But notice that it does take care to write out the DTD and provide some whitespace to indent the text. (It uses only newlines for end-of-line; output code is responsible for mapping those to CRLF or CR when needed.) Also, notice that it just generates SAX2 events; this data could be fed to an XMLWriter, or to the RssConsumer class, or to any other SAX-processing component.
import java.util.Enumeration; import org.xml.sax.*; import org.xml.sax.ext.LexicalHandler; import org.xml.sax.helpers.AttributesImpl; import RssChannel.RssItem; public class RssProducer implements RssHandler { private static char lineEnd [] = { '\n', '\t', '\t', '\t' }; private ContentHandler content; private LexicalHandler lexical; public RssProducer (ContentHandler n) { content = n; } public void setLexicalHandler (LexicalHandler l) { lexical = l; } private void doIndent (int n) throws SAXException { n++; // NL if (n > lineEnd.length) n = lineEnd.length; content.ignorableWhitespace (lineEnd, 0, n); } private void element (int indent, String name, String val, Attributes atts) throws SAXException { char contents [] = val.toCharArray (); doIndent (indent); content.startElement ("", "", name, atts); content.characters (contents, 0, contents.length); content.endElement ("", "", name); } public void channelUpdate (RssChannel channel) throws SAXException { AttributesImpl atts = new AttributesImpl (); content.startDocument (); if (lexical != null) { lexical.startDTD ("rss", "-//Netscape Communications//DTD RSS 0.91//EN", "http://my.netscape.com/publish/formats/rss-0.91.dtd"); lexical.endDTD (); } atts.addAttribute ("", "", "version", "CDATA", "0.91"); content.startElement ("", "", "rss", atts); atts.clear (); doIndent (0); content.startElement ("", "", "channel", atts); // describe the channel // four required elements element (1, "title", channel.title, atts); element (1, "link", channel.link, atts); element (1, "description", channel.description, atts); element (1, "language", channel.language, atts); // optional elements if ("" != channel.managingEditor) element (1, "managingEditor", channel.managingEditor, atts); if ("" != channel.webMaster) element (1, "webMaster", channel.webMaster, atts); // ... and many others, notably image/icon and text input // channel contents: at least one item for (Enumeration e = channel.items.elements (); e.hasMoreElements (); /**/) { RssItem item = (RssItem) e.nextElement (); doIndent (1); content.startElement ("", "", "item", atts); if ("" != item.title) element (2, "title", item.title, atts); if ("" != item.link) element (2, "link", item.link, atts); if ("" != item.description) element (2, "description", item.description, atts); doIndent (1); content.endElement ("", "", "item"); } content.endElement ("", "", "channel"); content.endElement ("", "", "rss"); content.endDocument (); } }
Since this code implements the RssHandler interface shown earlier, an instance of this class could be assigned as the RSS handler for the XMLFilter shown here. That could be useful if you wanted to round-trip RSS data. Round-tripping data can be a good way to test marshaling and unmarshaling code. You can create collections of input documents, and automatically unmarshal or remarshal their data. If you compare inputs and outputs, you can ensure that you haven't discarded any important information or added inappropriate text.
One of the most fundamental things you can do in an RSS application is act as a client: fetch a site's summary data and present it in some useful format. Often, your personal view of a web site is decorated with pages or sidebars that summarize the latest news as provided by other sites; they fetch RSS data, cache it, and reformat it as HTML or XHTML so your web browser shows it. That is, the web server acts as a client to RSS feeds and generates individualized pages that you can read on and click on the latest headlines.
Example 6-5 is a simple client that dumps its output as text. It's simple to write a servlet or JSP that does this for a set of RSS feeds, formatting them as nice XHTML sidebar tables so that a site's pages will be more useful.[26]
[26]If you do this in a server, you should handle one very important task that's not shown here: cache the RSS data! Do not make servers fetch the summary before each page view. That makes for a very slow user experience and can overload remote RSS feeds.
There are two basic techniques to use to create such a cache. One is to put a caching proxy between your server and all the RSS feeds. The other is to write a page cache module, preferably one that uses HTTP "conditional GET" (the If-Modified-Since HTTP header field) to avoid excess cache updates. You can save RssChannel data or store channel information in a local database, as variants of the page cache technique.
One extremely important point shown here is this code uses a resolver to force the use of a local copy of the RSS DTD. Servers should always use local copies of DTDs. Some RSS applications got a rude reminder of that fact in April 2001, when Netscape accidentally removed the DTD when it reorganized its web site. Suddenly, those badly written applications stopped working on many RSS feeds! Of course, those that were properly set up with local copies of that DTD had no problems at all.
import gnu.xml.util.Resolver; import java.io.File; import java.util.Hashtable; import org.xml.sax.*; import org.xml.sax.helpers.XMLReaderFactory; import RssChannel.RssItem; public class RssMain { private static String featurePrefix = "http://xml.org/sax/features/"; // Invoke with one argument, a URI or filename public static void main (String argv []) { if (argv.length != 1) { System.err.println ("Usage: RssMain [file|URL]"); System.exit (1); } try { XMLReader reader; RssConsumer consumer; Hashtable hashtable; Resolver resolver; reader = XMLReaderFactory.createXMLReader (); consumer = new RssConsumer (); reader.setContentHandler (consumer); // handle the "official" DTD server being offline hashtable = new Hashtable (5); hashtable.put ( "-//Netscape Communications//DTD RSS 0.91//EN", Resolver.fileNameToURL ("rss-0_91.dtd")); resolver = new Resolver (hashtable); reader.setEntityResolver (resolver); // we rely on qNames, and 0.91 doesn't use namespaces reader.setFeature (featurePrefix + "namespace-prefixes", true); reader.setFeature (featurePrefix + "namespaces", false); argv [0] = Resolver.getURL (argv [0]); reader.parse (argv [0]); RssChannel channel = consumer.getChannel (); System.out.println ("Partial RSS 0.91 channel info"); System.out.println ("SOURCE = " + channel.sourceUri); System.out.println (); System.out.println (" Title: " + channel.title); System.out.println (" Description: " + channel.description); System.out.println (" Link: " + channel.link); System.out.println (" Language: " + channel.language); System.out.println (" WebMaster: " + channel.webMaster); System.out.println ("ManagingEditor: " + channel.managingEditor); System.out.println (); System.out.println (" Item Count: " + channel.items.size ()); for (int i = 0; i < channel.items.size (); i++) { RssItem item = (RssItem) channel.items.elementAt (i); System.out.println ("ITEM # " + i); if (item != null) { System.out.println (" Title: " + item.title); System.out.println (" Description: " + item.description); System.out.println (" Link: " + item.link); } } // Good error handling is not shown here, for simplicity } catch (Exception e) { System.err.println ("Whoa: " + e.getMessage ()); System.exit (1); } System.exit (0); } }
Besides servlets that present RSS data in HTML form to a web site's clients, another kind of servlet is important in the world of RSS applications: servlets that deliver a site's own RSS feed as XML. Servers often arrange that the current channel data is always ready to serve at a moment's notice. You've probably worked with sites that give you HTML forms to publish either short articles (web log entries or discussion follow-ups) or long ones (perhaps XML DocBook source that's then formatted). When such forms post data through a servlet, it's easy to ensure the servlet updates the site's RSS channel data when it updates other site data for those articles.
While the mechanics of such a servlet would be specific to the procedures used at a given web site, almost any site could use code like that in Example 6-6 to actually deliver the RSS feed. Notice the XML text is delivered with an encoding that any XML parser is guaranteed to handle, using CRLF-style line ends (part of the MIME standard for text/* content types), and this sets the Last-Modified HTTP timestamp so it supports HTTP caches based on either "conditional GET" or on explicit timestamp checks with the HEAD request.
import gnu.xml.util.XMLWriter; import javax.servlet.http.*; // a "Globals" class is used here to access channel and related data public class RssGenServlet extend HttpServlet { public void doGet (HttpServletRequest request, HttpServletResponse response) throws IOException, ServletException { RssProducer producer; XMLWriter consumer; response.addDateHeader ("Last-Modified", Globals.channelModified); response.setContentType ("text/xml;charset=UTF-8"); consumer = new XMLWriter (response.getWriter ()); consumer.setEOL ("\r\n"); try { producer = new RssProducer (consumer); producer.setLexicalHandler (consumer); producer.channelUpdate (Globals.channel); } catch (SAXException e) { throw new ServletException (e); } } }
As RSS 1.0 starts to become more widely supported and more RSS/RDF modules are defined, more clever RSS-based services will become available. For example, RSS aggregator services may begin to be able to dynamically create new channels with information filtered from many other channels. That is, you could be able to define a channel that the aggregator will fill with new articles on a particular topic, listed in any of several hundred RSS feeds. Today, you'd have to scan each feed yourself to do that. Such smarter services would also have better reasons to cache information. Today, such a service would have a hard time knowing which articles to remember while you were away on vacation, since there would be far too many articles to remember them all.
Copyright © 2002 O'Reilly & Associates. All rights reserved.