This chapter briefly explains the most popular programming models for parsing and manipulating XML data in use today. XML processing includes a diverse set of tools, which require different approaches but offer distinct advantages and disadvantages.
TIP: XML processors of all kinds are available in a wide variety of languages, including C, C#, C++, COBOL, Haskell, Java, JavaScript (ECMAScript/JScript), Pascal, Perl, Python, Ruby, SmallTalk, Tcl, and Visual Basic. If you can't find XML support built into your programming environment, a quick search will likely locate a library. XML.com maintains a list of XML resources that may be a good place to start at http://www.xml.com/resourceguide/.
XML's structured and labeled text can be processed by developers in several of ways. Programs can look at XML as text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.
At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose.
Despite this textual nature, however, XML presents some serious limitations for programs that attempt to process XML documents as text documents. It is possible to process extremely simple XML documents reliably using basic textual tools like regular expressions, but this becomes much more difficult as features such as attribute defaulting, entity processing, and namespaces are added to documents. Using these features is extremely difficult when treating a document purely as text.
Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions -- in environments such as sed, grep, Perl, and Python -- can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. These tools can also be very useful for searching and querying the information in XML documents, even without an understanding of the surrounding structure.
Textual tools may also be applied to the results of an XML parser. Regular expressions and similar text-processing tools can be applied usefully to the results of an XML parse, working on the document when its XML-specific nature has already been resolved. The W3C's XML Schema, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 16. A smart search and replace or spell checker might process only the contents of elements (and perhaps attributes), not the markup that defines the structures.
Text-based processing can be preformed in conjunction with other XML processing. Parsing and then reserializing XML documents after other processing has taken place doesn't always produce the desired results. XSLT, for instance, will remove entity references and replace them with entity content. Preserving entities requires replacing them in the original document with unique placeholders, and then replacing the placeholder as it appears in the result. With regular expressions, this is quite easy to do. Developers may also need to replace particular characters with references to images; this approach can be very useful where an obscure or nonstandard glyph is needed in XHTML.
WARNING: XML's dependence on Unicode means that developers need to be careful about the text-processing tools they choose. Many development environments have been upgraded to support Unicode, but there are still tools available that don't. Before using text-processing tools on the results of an XML parse, make sure they support Unicode. Text-processing tools being applied to raw XML documents must support the character encoding used for the document.
As an XML parser reads a document, it moves from the beginning of the document to the end. It may pause to retrieve external resources--for a DTD or an external entity, for instance--but it builds an understanding of the document as it moves along. Enforcing well-formedness and validity constraints and applying namespaces requires keeping track of context; applying attribute defaults and entities requires keeping a list of appropriate content to insert; but the end result is a complete "reading" of the XML document.
Event-based parsers report this reading as it happens, in a stream of events representing the information in the document. The "events" are, for example, the start of an element, the content of an element, and the end of an element. For example, given this document:
<name><given>Keith</given><family>Johnson</family></name>
an event-based parser might report events such as this:
startElement:name startElement:given content: Keith endElement:given startElement:family content:Johnson endElement:family endElement:name
The list and structure of events can become much more complex as features, such as namespaces, attributes, whitespace between elements, comments, processing instructions, and entities are added, but the basic mechanism is quite simple and generally very efficient.
Event-based parsers only have to keep track of a limited amount of information. They need to understand the contents of DTDs (and possibly schemas), if the documents use them, and they need to maintain context stacks for element names and namespace declarations. They don't need to build a complete record of the document as they parse it, which minimizes the amount of memory needed for the parse.
Event-based parsers require the consumer of the events to do a lot more work, however. Processing events typically means the creation of a state machine, i.e., code that understands current context and can route the information in the events to the proper consumer. Because events occur as the document is read, applications must be prepared to discard results should a fatal error occur partway through the document. Applications can't depend on information that occurs later in a document to interpret the current event, either, making it hard to use some kinds of XPaths, for instance, in an event-based environment. These factors can make it difficult to work directly with event-based parsers.
Despite the potential difficulty, event-based parsers are very useful for a wide variety of tasks. Filters can process and modify events before passing them to another processor, efficiently performing a wide range of transformations. Filters can be stacked, providing a relatively simple means of building XML processing pipelines, where the information from one processor flows directly into another. Applications that want to feed information directly from XML documents into their own internal structures may find events to be the most efficient means of doing that. Even parsers that report XML documents as complete trees, as described in the next section, typically build those trees from a stream of events.
TIP: The Simple API for XML (SAX), described in Chapter 19 and Chapter 25, is the most commonly used event-based API. SAX2, the current version of SAX, is hosted at http://www.saxproject.org.Expat, which is a widely used XML parser written in C, also uses an event-based API. For more information on the expat parser and its API, see http://www.jclark.com/xml/expat.html.
XML documents, because of the requirements for well-formedness, describe tree structures. Documents typically contain an element that then contains text, attributes, and other elements, and these may contain elements, text, and attributes, and so on. Declarations, comments, and processing instructions enrich the mix, but all basically hold positions in the overall tree.
There are a wide variety of tree models for XML documents. XPath (described in Chapter 9), used in XSLT transformations, has a slightly different set of expectations than does the Document Object Model (DOM) API, which is also different from the XML Information Set (Infoset), another W3C project. XML Schema (described in Chapter 16 and Chapter 21) defines a Post-Schema Validation Infoset (PSVI), which has more information in it (derived from the XML Schema) than any of the others.
Developers who want to manipulate documents from their programs typically use APIs that provide access to an object model representing the XML document. Tree-based APIs typically present a model of an entire document to an application once parsing has successfully concluded. Applications don't have to worry about figuring out context or dealing with rollback when an error is encountered, since the tree model and parsing already address those issues. Rather than following a stream of events, an application can just navigate a tree to find the desired pieces of a document. Browsers and editors can present or modify the tree in conformance with user or script requests, using the tree as a persistent reference to the current content of the document.
Working with a tree model of a document isn't very different conceptually from working with a document as text. The entire document is always available, and moving around well-formed portions of a document or modifying them is fairly easy. The complete set of context for any given part of the document is always available. Developers can use XPath expressions to locate content and make decisions based on content anywhere in the document where APIs support XPath. (DOM Level 3 adds formal support for XPath, and various implementations provide their own support.)
Tree models of documents have a few drawbacks. They can take up large chunks of memory, typically multiplying the original document's size. Navigating documents can require additional processing after the parse, as developers have more options available to them. (Tree models don't impose the same kinds of discipline as event-based processing.) Both of these issues can make it difficult to scale and share applications that rely on tree models, though they may still be appropriate where small numbers of documents or small documents are being used.
TIP: The Document Object Model (DOM), described in Chapter 18 and Chapter 24, is the most common tree-based API. JDOM (http://jdom.org/) and DOM4J (http://dom4j.org/) are Java-only alternatives.
Another facility available to the XML programmer is a form of the XML transformation library. The Extensible Stylesheet Language Transformation (XSLT) language, covered in Chapter 8, is the most popular tool currently available for transforming XML to HTML, XML, or any other regular language that can be expressed in XSLT. In some cases, using a transformation to perform pre- or post-processing on XML data when processing it with either DOM or SAX might be simpler or more efficient. For instance, XSLT could be used as a preprocessor for a screen-scraping application that starts from XHTML documents. A script could extract the meaningful features from the XHTML document and pour them into an application-specific XML format.
Transformations may be used by themselves, in browsers, or at the command line, but many XSLT implementations and other transformation tools offer SAX or DOM interfaces, simplifying the task of using them to build pipelines.
Developers who want to take advantage of XML's cross-platform benefits but have no patience for the details of markup can use various tools that rely on XML but don't require direct exposure to XML's structures. Web Services, mentioned in Chapter 15, can be seen as a move in this direction. You can still touch the XML directly if you need to, but toolkits make it easier to avoid doing so.
These kinds of applications are generally built as a layer on top of event- or tree-based processing, presenting their own API to the underlying information. This level of abstraction may be very useful in some cases or an inefficient inconvenience in others. It's probably helpful to understand more direct connections to XML if you need to evaluate the advantages and disadvantages of abstraction, as well as provide a bridge to systems that don't support a particular abstraction layer but still need access to the information.
The SAX and DOM specifications, along with the various core XML specifications, provide a foundation for XML processing. Implementations of these standards, especially implementations of the DOM, sometimes vary from the specification. Some extensions are themselves formally specified--Scalable Vector Graphics (SVG), for instance, specifies extensions to the DOM that are specific to working with SVG. Others are just kind of tacked on, adding functionality that a programmer or vendor felt was important but wasn't in the original specification. The multiple levels and modules of the DOM have also led to developers claiming support for the DOM, but actually supporting particular subsets (or extensions) of the available specifications.
Porting standards also leads to variations. SAX was developed for Java, and the core SAX project only defines a Java API. The DOM uses Interface Definition Language (IDL) to define its API, but different implementations have interpreted the IDL slightly differently. SAX2 and the DOM are somewhat portable, but moving between environments may require some unlearning and relearning.
Some environments also offer libraries well outside the SAX and DOM interfaces. Perl and Python both offer libraries that combine event and tree processing--for instance, permitting applications to work on partial trees rather than SAX events or full DOM trees. Microsoft .NET's XMLReader offers similarly flexible processing. These approaches do not make moving between environments easy, but they can be very useful.
While text, events, trees, and transformations may seem very different, it isn't unusual to combine them. Most parsers that produce DOM trees also offer the option of SAX events, and there are a number of tools that can create DOM trees from SAX events or vice versa. Some tools that accept and generate SAX events actually build internal trees--many XSLT processors operate this way, using optimized internal models for their trees rather than the generic DOM. XSLT processors themselves often accept either SAX events or DOM trees as input and can produce these models (or text) for their output.
Most programmers who want direct access to XML documents start with DOM trees, which are easier to figure out initially. If they have problems that are better solved in event-based environments, they can either rewrite their code for events--it's a big change--or mix and match event processing with tree processing.
Copyright © 2002 O'Reilly & Associates. All rights reserved.