More About ContentHandler
The LexicalHandler Interface
Exposing DTD Information
Turning SAX Events into Data Structures
XML Pipelines
Most of the power of SAX is exposed through event callbacks. In previous chapters you've seen some of the most widely used event callbacks as well as how to ensure that all the callbacks are generated and reported to application code.
This chapter presents the rest of the standard SAX event-handling interfaces (including the extension handlers), then talks about some of the common ways that event consumers use those interfaces. These interfaces are primarily implemented by application code that consumes events and needs to solve particular problems. You might also write custom event producers, which call these interfaces directly rather than expecting some type of XMLReader to issue them.
In Section 2.3, "Basic ContentHandler Events", in Chapter 2, "Introducing SAX2", we looked at the most important APIs used to handle XML document content. Some other APIs were deferred to this section because they aren't used as widely. Depending on what problems you're solving, you may rely heavily on some of these additional methods.
Five ContentHandler callbacks were discussed in Chapter 2: Section 2.3.4, "Essential ContentHandler Callbacks" explained how characters and element boundaries were reported, and Section 2.6.4, "ContentHandler and Prefix Mappings" explained how namespace-prefix scopes were reported. But the interface has five other methods. Here's what they do and when you'll want to use them:
This is normally the first callback from a parser; the single parameter is a Locator, discussed later. Strictly speaking, SAX parsers are not required to provide a locator or to make this callback; however, you'd want to avoid parsers that don't provide this information. Your implementation of this callback will normally just save the locator; it can't do much more since it's the only SAX event callback that can't throw a SAXException:
class MyHandler implements ContentHandler ... { private Locator locator; ... public void setDocumentLocator (Locator l) { locator = l; } ... }
Use this object as discussed later in this chapter, in Section 4.1.2, "The Locator Interface ". It is the standard way to report the base URI of the XML text currently being parsed; that information is essential for resolving relative URIs. It's also essential for diagnostics that tell you where application code detects errors in large quantities of XML text.
These two callbacks bracket processing for a document, and they are normally used to manage application state associated with the document being parsed. If you're parsing a document, these methods will always be called once each, even when parsing is cut short by a thrown exception. No other methods have such guarantees.
startDocument() is always called before any data is reported from the parser, and is normally used to initialize application data structures. It will usually be the second callback from the parser; parsers that provide a Locator will report that first. You can't rely on a setDocumentLocator() call before startDocument(); structure your initialization code to do the real work in the callback guaranteed to be available.
endDocument() is always called to report that no more document data will be provided. The normal application response is to clean up all state associated with the current parse. The parser closes any input data streams you gave it using an InputSource (discussed later), so the application doesn't need to do that. Cleanup would include forgetting any saved Locator since that object is no longer usable when the parse is complete. Also, you'd likely close other files or sockets that were opened while processing this document:
class MyHandler implements ContentHandler ... { ... public void startDocument () throws SAXException { // initialize data structures for ALL handlers here ... } public void endDocument () throws SAXException { // free those same data structures locator = null; elementStack = null; ... } ... }
These two calls are widely used in robust SAX code because they provide such good hooks to control memory usage and manage associated file descriptors. However, some SAX2 parsers have a bug that reduces the robustness offered by SAX; they won't correctly call endDocument() when parsing is aborted by throwing exceptions.
Processing Instructions (PIs) are used in XML for data that doesn't obey the rules of a DTD. They can be placed anywhere in a document, including within the DTD, except inside other markup constructs like tags. Unlike comments, PIs are designed for applications to use. They're part of the document structure that programmatic logic must understand; they can follow rules, just not ones found in a DTD or schema. This method has two parameters:
XML applications use this parameter to determine how to handle the PI. You can rely on the fact that it'll never be the string xml (in any combination of upper- and lowercase characters) because XML and text declarations are not processing instructions.
Some documents follow the convention that the target of a PI names a notation (perhaps the fully qualified URI found in its system identifier) and the meaning is associated with the notation rather than the name. That's a fine practice to follow, but it isn't essential. Most code just compares target names as strings, rather than use data reported with DTDHandler.notationDecl() to figure out what a target name should mean.
This parameter is data associated with the PI, and it may be the null string if no data was provided after the target name. Some applications use the syntax of an attribute here; others don't bother.
Processing instructions are natural to use in template systems and other document-oriented applications.[19]
[19]For example, the syntax of PHP, the web page scripting tool, looks like a processing instruction, <?php ...?>. For various reasons, PHP is not actually an XML document syntax.
Processing instructions are normally safe to ignore when your processing doesn't recognize them (passing them on to any subsequent processing stage), or to store. If the parser does recognize them, it normally acts on then immediately. For example, an <?xml-stylesheet ...?> PI might select a particular XSLT stylesheet to use for generating a servlet's output. The processing instruction event is used later, in Example 6-9.
This is an optional callback, made by most parsers (including all that are validating) to report whitespace that separates elements in element content models, like those of the form (title,para*,sect1*) but not (#PCDATA|para|comment)*, ANY, or EMPTY. Whitespace before or after the document's root element is not treated as ignorable and is completely discarded. Providing this information is a requirement of the XML specification, since this kind of whitespace is defined to be markup rather than document content. If the parser doesn't see such a content model declaration for any reason, it can't use this callback; it'll use characters() instead, and applications will need to figure out if the whitespace is part of markup or part of content.
The parameters are exactly the same as those of the characters() callback, except that you know the characters in the specified range will all be spaces, tabs, or newlines. (Keep that in mind if you're directly producing ignorable whitespace to feed some event consumer. Using CRLF- or CR-style line ends here is a bug, though you might not see immediate consequences.) Like characters(), this method can be called several times in a row, to complete processing a single stretch of characters.
There are two popular ways to handle this callback. My favorite is to drop all the characters; they're only in the source document to make the elements lay out nicely, so they won't ever mean anything. There's rarely a reason to even look at the data, much less save it. The other option is to delegate handling and just call the characters() callback with the whitespace.
The parameter is a String that identifies an internal or external parsed entity. General entity names are presented as found in their declarations (dudley). Parameter entity names begin with a percent sign (%nell). The external DTD subset is special; it's an unnamed parameter entity and is reported with the name [dtd]. You might not be able to tell if the skipped entity was an internal or external entity, even using DeclHandler events.
You probably don't ever want to see this call, since it means that part of your document has been hidden. XML 1.0 processors are required to report this case; SAX 1.0 didn't, and most other parser-level APIs (such as DOM Level 2) still don't. This is a call that only nonvalidating parsers may issue, and even then only if they are not parsing all the external entities referred to in documents -- that is, where one or both of the external entities feature flags is set to false, to disable reading external general or parameter entities. No widely used Java parsers clear those flags by default, so this is a rare call in Java. However some C parsers, such as Expat (used in Mozilla), won't normally parse external entities, so the notion isn't exotic in all languages.
This useful interface is sometimes overlooked. It gives information that is essential for providing location-sensitive diagnostics and is often given to SAXParseException constructors. That same information is also needed to resolve relative URIs in document content or attribute values (such as xml:base). Parsers provide one instance of this class, which can be used inside event callbacks to find what entity triggered the event and approximately where. Use that locator only during such callbacks. There are only a few methods in this class.
This is the most important method in this interface. It returns the base URI (system ID) for the entity being parsed; this is always an absolute URI. (However, versions of Xerces that are current at this writing have a bug here. They sometimes return nonabsolute URIs.) Use this method to identify the document or external entity in diagnostics or to resolve relative URIs (perhaps in conjunction with xml:base attributes).
If the parser doesn't know this value, null is returned. This normally indicates that the parser was not given such a URI inside of a InputSource encapsulating document text. That's bad practice except when it's unavoidable, such as parsing in-memory data or input to the POST method in a servlet.
These two functions approximate the current position of a parser within an entity. The position reflected is where the relevant event's data ended. It is only an approximation for diagnostics, but most parsers do try to be accurate about the line number.
These numbers count up from 1 as appropriate for user-oriented diagnostics. Not all implementations will provide these values; the value -1 is returned to indicate that no value was provided.
A public identifier may be provided with this method. Otherwise null is returned. This may be useful for diagnostics in some cases.
One common use for a locator is to report an error detected while an application processes document content. The SAXParseException class has two constructors that take locator parameters. (The descriptive string is always first, the locator is second, and an optional "root cause" exception is third.) Once you create such an exception, it can be thrown directly, which always terminates a parse. Or you pass it to an ErrorHandler to centralize error handling-policy in your application:
// "locator" was saved when setDocumentLocator() was called earlier // or was initialized to null; this is safe in both cases try { ... engine.setWarpFactor (11); ... } catch (DriveException e) { SAXParseException spe = new SAXParseException ( "The warp engine's gonna blow!", locator, e); errHandler.error (e); // we'll get here whenever such problems are ignored }
To resolve relative URIs in document content -- for example, one found in an <xhtml:a href="..."/> reference in a link checker -- you'd use code like this (ignoring xml:base complications):
public void startElement (String uri, String lname, String qname, Attributes atts) throws SAXException { if (xhtmlURI.equals (uri)) { if ("a".equals (lname)) { String href = atts.getValue ("href"); if (href != null) { // ASSUMES: locator is nonnull System.out.println ("Found href to: " + new URI (new URI(locator.getSystemId ()), href)); } // else presumably <xhtml:a name="..."/> } } ... }
Some of the XMLReader implementations cannot possibly call ContentHandler.setDocumentLocator() with a Locator. When parsing in-memory data structures, such as a DOM document, a locator will normally be meaningless. When parsing in-memory buffers like a String (with a StringReader), there won't usually be a URI in the locator.
If your application supports the layered xml:base convention (which lets documents "lie" about their true locations for purposes of resolving relative URIs), it will need to track those attributes itself, as part of a context stack mechanism. (An example of such a stack is shown later, in Example 5-1.) Such attributes can sometimes help make up for SAX event sources that can't provide locator information, such as DOM-to-SAX producers. But they can confuse things too: in the following example, xml:base would apply to the top element and its direct children, but nothing within the external entity reference. (Let's assume, for the sake of discussion, that no element has an xml:base attribute.)
<top xml:base="http://www.example.com/moved/doc2.xml"> <xhtml:a href="abc.xml"/> <xhtml:div> &external; </xhtml:div> <xhtml:a href="xyz.xml"/> </top>
When character content of an element is reported, characters from different external entities will get different callbacks, so the locator can be used to tell those different entities apart from each other.
One of the goals of XML was to bring Unicode into widespread use so that the Web could really become worldwide in terms of people, not just technology. This brings several concerns into text management. You may not need to worry about these if you're working only in ASCII or with just one character encoding. While you're just starting out with Java and XML you should certainly avoid worrying about these details. Some other users of SAX2 will need to understand these issues. Since they surface primarily with ContentHandler event callbacks, we briefly summarize them here.
If your application works with MathML, or in various languages whose character sets gained support in Unicode 3.1 through the so-called Astral Planes, you will need to know that what Java calls a char is not really the same thing as a Unicode character or an XML character. If you aren't using such languages, you'll probably be able to ignore this issue for a while. Still, you might want to read about Unicode 3.1 to learn more about this and minimize trouble later. By the time you read this, the W3C may even have completed its "Blueberry" XML update, intended to allow the use of some such characters within XML names.
In the case of such characters, whose Unicode code point is above the value U+FFFF (the maximum 16-bit code point), these characters are mapped to two Java char values, called a surrogate pair. The char values are in a range reserved for surrogate characters, with a high surrogate always immediately followed by a low surrogate. (This is called a big-endian sequence.) Surrogate pairs can show up in several places in XML, and hence in SAX2: in character content, processing instructions, attribute values (including defaults in the DTD), and comments.
At this time, Java does not have APIs to explicitly support characters using surrogate pairs, although character arrays and java.lang.String will hold them as if the char values weren't part of the same character. The java.lang.Character class doesn't recognize surrogate pairs. The best precaution seems to be to prefer APIs that talk in terms of slices of character arrays (or Strings), rather than in terms of individual Java char values. This approach also handles other situations where more than one char value is needed per character.
Depending on the character encodings you're using and the applications you're implementing, you may also need to pay attention to the W3C Character Model (http://www.w3.org/TR/charmod/ at this writing) and Unicode Normalization Form C. Briefly, these aim to eliminate undesirable representations of characters and to handle some other cases where Unicode characters aren't the same as XML characters or a Java char, such as composite characters. For example, many accented characters are represented by composing two or more Unicode characters. Systems work better when they only need to handle one way to represent such characters, and Form C addresses that problem.
Copyright © 2002 O'Reilly & Associates. All rights reserved.