Gotcha! (Java & XML, 2nd Edition)

Before leaving this introduction to parsing XML documents with SAX, there are a few pitfalls to make you aware of. These "gotchas" will help you avoid common programming mistakes when using SAX, and I will discuss more of these for other APIs in the appropriate sections.

3.5.1. My Parser Doesn't Support SAX 2.0

For those of you who are forced to use a SAX 1.0 parser, perhaps in an existing application, don't despair. First, you always have the option of changing parsers; keeping current on SAX standards is an important part of an XML parser's responsibility, and if your vendor is not doing this, you may have other concerns to address with them as well. However, there are certainly cases where you are forced to use a parser because of legacy code or applications; in these situations, you are still not left out in the cold.

SAX 2.0 includes a helper class, org.xml.sax.helpers.ParserAdapter, which can actually cause a SAX 1.0 Parser implementation to behave like a SAX 2.0 XMLReader implementation. This handy class takes in a 1.0 Parser implementation as an argument and then can be used instead of that implementation. It allows a ContentHandler to be set (which is a SAX 2.0 construct), and handles all namespace callbacks properly (also a feature of SAX 2.0). The only functionality loss you will see is that skipped entities will not be reported, as this capability was not available in a 1.0 implementation in any form, and cannot be emulated by a 2.0 adapter class. Example 3-3 shows this behavior in action.

Example 3-3. Using SAX 1.0 with SAX 2.0 code constructs

try {
    // Register a parser with SAX
    Parser parser = 
        ParserFactory.makeParser(
            "org.apache.xerces.parsers.SAXParser");
            
    ParserAdapter myParser = new ParserAdapter(parser);
                                        
    // Register the document handler
    myParser.setContentHandler(contentHandler);
    
    // Register the error handler
    myParser.setErrorHandler(errHandler);            
        
    // Parse the document      
    myParser.parse(uri);
    
} catch (ClassNotFoundException e) {
    System.out.println(
        "The parser class could not be found.");
} catch (IllegalAccessException e) {
    System.out.println(
        "Insufficient privileges to load the parser class.");
} catch (InstantiationException e) {
    System.out.println(
        "The parser class could not be instantiated.");
} catch (ClassCastException e) {
    System.out.println(
        "The parser does not implement org.xml.sax.Parser");
} catch (IOException e) {
    System.out.println("Error reaading URI: " + e.getMessage( ));
} catch (SAXException e) {
    System.out.println("Error in parsing: " + e.getMessage( ));
}

If SAX is new to you and this example doesn't make much sense, don't worry about it; you are using the latest and greatest version of SAX (2.0) and probably won't ever have to write code like this. This code is helpful only in cases where a 1.0 parser must be used.

3.5.2. The SAX XMLReader: Reused and Reentrant

One of Java's nicest features is the easy reuse of objects, and the memory advantages of this reuse. SAX parsers are no different. Once an XMLReader has been instantiated, it is possible to continue using that reader, parsing several or even hundreds of XML documents. Different documents or InputSources may be continually passed to a reader, allowing it to be used for a variety of different tasks. However, readers are not reentrant. That means that once the parsing process has started, a reader may not be used until the parsing of the requested document or input has completed. In other words, the process cannot be reentered. For those prone to coding recursive methods, this is definitely a gotcha! The first time that you attempt to use a reader that is in the middle of processing another document, you will receive a rather nasty SAXException and all parsing will stop. What is the lesson learned? Parse one document at a time, or pay the price of instantiating multiple reader instances.

3.5.3. The Misplaced Locator

Another dangerous but seemingly innocuous feature of SAX events is the Locator instance that is made available through the setDocumentLocator( ) callback method. This gives the application the origin of a SAX event, and is useful for making decisions about the progress of parsing and reaction to events. However, this origin point is valid only for the duration of the life of the ContentHandler instance; once parsing is complete, the Locator is no longer valid, including the case when another parse begins. A "gotcha" that many XML newcomers make is to hold a reference to the Locator object within a class member variable outside of the callback method:

public void setDocumentLocator(Locator locator) {
    // Saving the Locator to a class outside the ContentHandler
    myOtherClass.setLocator(locator);
}
...

public myOtherClassMethod( ) {
    // Trying to use this outside of the ContentHandler
    System.out.println(locator.getLineNumber( ));
}

This is an extremely bad idea, as this Locator instance becomes meaningless as soon as the scope of the ContentHandler implementation is left. Often, using the member variable resulting from this operation results in not only erroneous information being supplied to an application, but exceptions being generated in the running code. In other words, use this object locally, and not globally. In the JTreeContentHandler implementation class, the supplied Locator instance is saved to a member variable. It could then correctly be used (for example) to give you the line number of each element as it was encountered:

public void startElement(String namespaceURI, String localName,
                         String rawName, Attributes atts)
    throws SAXException {
    
    DefaultMutableTreeNode element =
        new DefaultMutableTreeNode("Element: " + localName +
            " at line " + locator.getLineNumber());
    current.add(element);
    // Rest of existing code...
}

3.5.4. Getting Ahead of the Data

The characters( ) callback method accepts a character array, as well as start and length parameters, to signify which index to start at and how far to read into the array. This can cause some confusion; a common mistake is to include code like this example to read from the character array:

public void characters(char[] ch, int start, int length)
    throws SAXException {

    for (int i=0; i<ch.length; i++)
        System.out.print(i);
}

The mistake here is in reading from the beginning to the end of the character array. This natural "gotcha" results from years of iterating through arrays, either in Java, C, or another language. However, in the case of a SAX event, this can cause quite a bug. SAX parsers are required to pass starting and length values on the character array that any loop constructs should use to read from the array. This allows lower-level manipulation of textual data to occur in order to optimize parser performance, such as reading data ahead of the current location as well as array reuse. This is all legal behavior within SAX, as the expectation is that a wrapping application will not try to "read past" the length parameter sent to the callback.

Mistakes as in the example shown can result in gibberish data being output to the screen or used within the wrapping application, and are almost always problematic for applications. The loop construct looks very normal and compiles without a hitch, so this gotcha can be a very tricky problem to track down. Instead, you can simply convert this data to a String, use it, and never worry:

public void characters(char[] ch, int start, int length)
    throws SAXException {

    String data = new String(ch, start, length);
    // Use the string
}


3.4. Error Handlers		3.6. What's Next?

3.5. Gotcha!

3.5.1. My Parser Doesn't Support SAX 2.0

Example 3-3. Using SAX 1.0 with SAX 2.0 code constructs

3.5.2. The SAX XMLReader: Reused and Reentrant

3.5.3. The Misplaced Locator

3.5.4. Getting Ahead of the Data