You've just seen how the parts of a SAX2 application fit together, so now you're ready to see how the data is actually handled as it arrives. Here we focus on the events that deal with the core XML data model of elements, attributes, and text. To work with that model, you need to use only a handful of methods from the ContentHandler interface.
As mentioned earlier, this class is a convenient way to start using SAX2 because it provides stubs for many of the handler methods. You can just override those stubs with methods to do real work. Using DefaultHandler as a base class is just an implementation option. It's often just as convenient not to use such a base class. The class is used in this chapter to avoid explaining handler methods that you don't really need.
In some scenarios, Sun's JAXP requires you to use DefaultHandler as a base class. That's much more of a restriction than SAX itself makes. If you stick to using the SAX XMLReader API, as recommended in this book, you'll still have the option of using DefaultHandler as a base class, but this policy won't be imposed on your application code. For example, you can have separate objects to encapsulate policies such as error handling, so you won't need to hardwire all such policies into a single class.
Let's use this simple XML document to learn the most essential SAX callbacks:
<stanza> <line>In a cavern, in a canyon,</line> <line>Excavating for a mine,</line> <line>Dwelt a miner, forty-niner,</line> <line>And his daughter Clementine.</line> </stanza>
This is a simple document, only elements and text, with no attributes, DTD, or namespaces to complicate the code we're going to write. When SAX2 parses the document, our ContentHandler implementation will see events reported for those elements and for the text. The calls will be more or less as follows; they're indented here to correspond to the XML text, and the characters() calls show strings since slices of character arrays are awkward:
startElement ("", "", "stanza", empty) characters ("\n ") startElement ("", "", "line", empty) characters ("In a cavern, i"); characters ("n a canyon,"); endElement ("", "", "line") characters ("\n ") startElement ("", "", "line", empty) characters ("Excavating for a mine,"); endElement ("", "", "line") characters ("\n ") startElement ("", "", "line", empty) characters ("Dwelt a miner, forty-niner,"); endElement ("", "", "line") characters ("\n ") startElement ("", "", "line", empty) characters ("And his daughter"); characters (" Clementine."); endElement ("", "", "line") characters ("\n") endElement ("", "", "stanza")
Notice that SAX does not guarantee that all logically consecutive characters will appear in a single characters() event callback. With this simple text, most parsers would deliver it in one chunk, but your application code can't rely on that always being done. Also, notice that the first two parameters of startElement() are empty strings; they hold namespace information, which we explain toward the end of this chapter. For now, ignore them and the last parameter, which is for the element's attributes.
For our first real work with XML, let's write code that prints only the lyrics of that song, stripping out the element markup. We'll start with the characters() method, which delivers characters in part of a character buffer with a method signature like the analogous java.io.Reader.read() method. This looks like Example 2-2.
public class Example extends DefaultHandler { public void characters (char buf [], int offset, int length) throws SAXException { System.out.write (new String (buf, offset, length)); } }
If you create an instance of this Example class instead of DefaultHandler in Example 2-1 and then run the resulting program[9] with a URL for the XML text shown earlier, you'll see the output.
[9]On some systems, the user will need to provide system property on the command line, passing -Dorg.xml.sax.driver=..., as shown in Section 3.2, "Bootstrapping an XMLReader" in Chapter 3, "Producing SAX2 Events".
<userinput>$ java Skeleton file:///db/sax2/verse.xml</userinput> In a cavern, in a canyon, Excavating for a mine, Dwelt a miner, forty-niner, And his daughter Clementine. $
You'll notice some extra space. It came from the whitespace used to indent the markup! If we had a DTD, the SAX parser might well report this as "ignorable whitespace." (See Section 4.1.1, "Other ContentHandler Methods " in Chapter 4, "Consuming SAX2 Events" for information about this callback.) But we don't have one, so to get rid of that markup we should really print only text that's found inside of <line> elements. In this case, we can use code like Example 2-3 to avoid printing that extra whitespace; however, we'll have to add our own line ends since the input lines won't have any.
public class Example extends DefaultHandler { private boolean ignore = true; public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException { if ("line".equals (qName)) ignore = false; } public void endElement (String uri, String local, String qName) throws SAXException { if ("line".equals (qName)) { System.out.println (); ignore = true; } } public void characters (char buf [], int offset, int length) throws SAXException { if (ignore) return; System.out.write (new String (buf, offset, length)); } }
With a more complicated content model, this particular algorithm probably wouldn't work. SAX content handlers are often written to understand particular content models and to carefully track application state within parses. They often keep a stack of open element names and attributes, along with other state that's specific to the particular task the content handler performs (such as the "ignored" flag in this example). A full example of an element/attribute stack is shown later, in Example 5-1.[10]
[10]Whitespace handling in text can get quite messy. XML defines an xml:space attribute that may have either of two values in a document: default, signifying that whatever your application wants to do with whitespace is fine, and preserve, which suggests that whitespace such as line breaks and indentation should be preserved. W3C XML Schemas replace default with two other options to provide a partial match for the whitespace normalization rules that apply to attribute values.
In simple cases like this, where namespaces aren't involved, you could use a particularly simple stack, as shown in Example 2-4. You can use such an element stack for many purposes. The depth of the stack corresponds to the depth of element nesting. This feature can help you debug by allowing you to structurally indent diagnostics. You can also use the stack contents to make decisions: maybe you want to print line elements that are from some stanza of a song, but not lines spoken by a character in a play. To do that, you might verify that the parent element of the line was a stanza. Make sure you understand how this example works; once you understand how startElement() and endElement() always match, as well as how they represent the document structure, you'll understand an essential part of how SAX works.
public class Example extends DefaultHandler { private Stack stack = new Stack (); public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException { stack.push (qName); } public void endElement (String uri, String local, String qName) throws SAXException { if ("line".equals (qName)) System.out.println (); stack.pop (); } public void characters (char buf [], int offset, int length) throws SAXException { if (!"line".equals (stack.peek ())) return; System.out.write (new String (buf, offset, length)); } }
Although they didn't appear in this simple scenario, most startElement() callbacks will have if/then/else decision trees that compare element names. Or if you're the kind of developer who likes to generalize such techniques, you can store per-element handlers in some sort of table and look them up by name. In both cases, you need to have some way to handle unexpected elements, and because of XML namespaces, the qName parameter isn't always what you should check first. One policy is just to ignore unexpected elements, which is what most HTML browsers do with unexpected tags. Another policy is to treat them as some kind of document validity error.
In the previous section, we skipped over the attributes provided with each element. Let's look at them in a bit more detail.
SAX2 wraps the attributes of an element into a single Attributes object. For any attribute, there are three things to know: its name, its value, and its type. There are two basic ways to get at the attributes: by an integer index (think "array") or by names. The only real complication is there are two kinds of attribute name, courtesy of the XML Namespaces specification.
You often need to write handler code that uses the value of a specific attribute. To do this, use code that accesses attribute values directly, using the appropriate type of name as arguments to a getValue() call. If the attribute name has a namespace URI, you'll pass the URI and the local name (as discussed later in this chapter). Otherwise you'll just pass a single argument. A value that is an empty string would be a real attribute value, but if a null value is returned, no value was known. In such a case, your application might need to infer some nonempty attribute value. (This is common for #IMPLIED attributes.)
Consider this XML element:
<billable label='finance' xmlns:units="http://www.example.com/ns/units" units:currency="NLG" > 25000 </billable>
Application code might need to enforce a policy that it won't present documents with such data to users that aren't permitted to see "finance" labeled data. That might be a meaningful policy for code running in application servers where users could only access data through the server. Code to enforce that policy might look like this:
public void startElement (String uri, String local, String qName, Attributes atts) throws SAXException { String value; value = atts.getValue ("label"); if ("finance".equals (value) && !userClearedForFinanceData getUser ())) throw new SAXException ("you can't see this data"); ... process the element }
Other application code might need to know the currency in which the billable amount was expressed. In this example, this information is provided using namespace-style naming, so you would use the other kind of accessor to ensure that you see the data no matter what prefix is used to identify that namespace:
String currency; currency = atts.getValue ("http://www.example.com/ns/units", "currency"); // what's the best exchange rate today?
There are corresponding getType() accessors, which accept both types of attribute names, but you shouldn't want to use those. After all, if you know enough about the attribute to access it by name and to process it, you should certainly know its type already!
Accessing attribute values or types using an index is faster than looking up their names. If you need to access attribute values or types more than once, consider using the appropriate one of the two getIndex() calls to get and save the index, as well as using the third syntax of the getValue() or getType() calls (shown in the next section).
You might need to look at all the attributes provided with an element, particularly when you're building infrastructure components. Here's how you might use an index to iterate over all the attributes you were given in a startElement() callback and print all the important information. This code uses a few methods that we'll explain later when we discuss namespace support. getLength() works like the "length" attribute on an array.
Attribute atts = ...; int length = atts.getLength (); for (int i = 0; i < length; i++) { String uri = atts.getURI (i); // Does this have a namespace-style name? if (uri.length () > 0) { System.out.print ("{ " + uri); System.out.print (" " + atts.getLocalName (i) + " }"); // no namespace } else System.out.println (atts.getQName (i)); // value comes from document, or is defaulted from DTD System.out.print (", value = " + atts.getValue (i)) // type is CDATA unless it comes from <!ATTLIST ...> in DTD System.out.print (", type = " + atts.getType (i)) }
You'll notice that accomodating input documents that use XML namespaces has complicated this code. It's important to remember that from the SAX perspective, attributes can have either of two kinds of names, and you must not use the wrong kind of name. (The same is true for elements.) Application code that handles arbitrary input documents will usually needs to handle both types of names, using the logic shown earlier. It's rarely safe to assume your input documents will only use one kind of name.
It's often good practice to scan through all the attributes for an element and report some kind of validity error if a document has unexpected attributes. (These might include xmlns or xmlns:* attributes, but often it's best to just ignore those.) This can serve as a sanity check or a kind of procedural validation. For example, if you validated the input against its own DTD, that DTD might have been modified (using the internal subset or some other mechanism) so that it no longer meets your program's expectations. Such a scan over attribute values can be a good time to make sure your application does the right thing with any attributes that need to be #IMPLIED, or have type ID.
Attribute values will always be whitespace-normalized as required by the XML specification. This means that the only whitespace in an attribute will be space characters or whitespace provided by character references to a tab, newline, or carriage return. If the type isn't reported as CDATA, additional normalization is done: leading and trailing spaces are stripped, and consecutive space characters are replaced by a single space.
If the parser read the DTD, you are able to see the XML attribute type it declared. The best way to see this type is to use the DeclHandler.attributeDecl() event, which needs a bit of advance planning. (This callback is discussed later in Section 4.3.1, "The DeclHandler Interface " in Chapter 4, "Consuming SAX2 Events".) Or you can use the Attributes.getType() methods if you can deal with incomplete reporting for enumerated types. (You won't see the possible values, and the type will either be NOTATION or NMTOKEN.)
The Attributes object passed to startElement() is only usable during that callback. If you need access to information found there, you must copy it. A utility AttributesImpl class is available, with a copy constructor, and is discussed in Chapter 5, "Other SAX Classes" in Section 5.1.1, "The AttributesImpl Class "..
The methods in the Attributes interface are summarized in Appendix A, "SAX2 API Summary". For more information, consult the SAX javadoc.
In the earlier code example, we used some callbacks without really explaining what they did and what their parameters were. This section provides more details.
In the summaries of handler callbacks presented in this book, the event signatures are omitted. This is just for simplicity: with a single exception (ContentHandler.setDocumentLocator()), the event signature is always the same. Every handler can throw a SAXException to terminate parsing, as well as java.lang.RuntimeExceptions and java.lang.Error, which any Java method can throw. Handlers can throw such exceptions directly, or as a slightly more advanced technique, they can delegate the error-handling policies to an ErrorHandler and recover cleanly if those calls return instead of throwing exceptions. (ErrorHandler is discussed later in this chapter.)
The ContentHandler callbacks include:
These two callbacks bracket element content, starting with startElement() to identify the element and provide its attributes. Typically, startElement() will be followed by a series of other event callbacks to report child content, such as character data and other elements. After all children of the element have been reported, endElement() reports the end of the element.
For elements associated with a namespace URI, this is the URI. For other kinds of elements, this is the empty string.
For elements associated with a namespace URI, this is the element name with any prefix removed. For other kinds of elements, this is the empty string.
This is the element name as found in the XML text, but for elements associated with a namespace URI, this might be the empty string. (Don't rely on it being nonempty unless the URI is empty, or you've configured the parser in "mixed" namespace reporting mode as described later in this chapter, in Section 2.6.3, "Namespace Feature Flags".)
An element's attributes are only provided in the startElement() call. The atts object is owned by the parser and is only on short-term loan to the event callback. If your application code needs to save attribute data, it must make a copy. (The AttributesImpl helper class may help.)
These callbacks appear in pairs unless an exception is thrown to abort parsing. Even empty elements (like <this/>) cause two calls.
Most applications do a lot of work in startElement() callbacks to set up further processing, but endElement() work varies. Sometimes endElement() does nothing, sometimes it's just a quick state cleanup (popping stacks), and sometimes it's where all the work queued during an element's processing is finally performed.
Text content is provided as a range from a character array. Applications will often need to make a copy of this data, appending it either to another character array or to a StringBuffer. (Use strings if their extra cost is not a problem.) Then the "real action" to process character data would be taken when this callback learns that all the relevant characters have been provided, often because of a startElement() or endElement() call.
A character array holding the text being provided. You must ignore characters in this buffer that are outside of the specified range.
The index of the first character from the buffer that is in range.
The number of text characters that are in the range's buffer, beginning at the specified offset.
Application code must expect multiple sequential calls to this method. For example, it would be legal (but slow) for a parser to issue one callback per character. Content found in different external entities will be reported in different characters() invocations so location information is reported correctly. (This is described in Section 4.1.2, "The Locator Interface " in Chapter 4, "Consuming SAX2 Events".) Most parsers have only a limited amount of buffer space and will flush characters whenever the buffer fills; flushing can improve performance because it eliminates a need for extra buffer copies. Excess buffer copying is a classic performance killer in all I/O-intensive software.
The XML specification guarantees that you won't see CRLF- or CR-style line ends here. All the line ends from the document will use single newline characters ("\n"). However, some perverse documents might have placed character references to carriage returns into their text; if you see them, be aware that they're not real line ends!
There are many other methods in the ContentHandler interface, discussed later in Section 4.1.1, "Other ContentHandler Methods " in Chapter 4, "Consuming SAX2 Events".
Copyright © 2002 O'Reilly & Associates. All rights reserved.