start page | rating of books | rating of authors | reviews | copyrights

Perl CookbookPerl CookbookSearch this book

22.3. Parsing XML into SAX Events

22.3.1. Problem

You want to receive Simple API for XML (SAX) events from an XML parser because event-based parsing is faster and uses less memory than parsers that build a DOM tree.

22.3.2. Solution

Use the XML::SAX module from CPAN:

use XML::SAX::ParserFactory;
use MyHandler;

my $handler = MyHandler->new( );
my $parser = XML::SAX::ParserFactory->parser(Handler => $handler);

$parser->parse_uri($FILENAME);
# or
$parser->parse_string($XML);

Logic for handling events goes into the handler class (MyHandler in this example), which you write:

# in MyHandler.pm
package MyHandler;

use base qw(XML::SAX::Base);

sub start_element {   # method names are specified by SAX
  my ($self, $data) = @_;
  # $data is hash with keys like Name and Attributes
  # ...
}

# other possible methods include end_element( ) and characters( )

1;

22.3.3. Discussion

An XML processor that uses SAX has three parts: the XML parser that generates SAX events, the handler that reacts to them, and the stub that connects the two. The XML parser can be XML::Parser, XML::LibXML, or the pure Perl XML::SAX::PurePerl that comes with XML::SAX. The XML::SAX::ParserFactory module selects a parser for you and connects it to your handler. Your handler takes the form of a class that inherits from XML::SAX::Base. The stub is the program shown in the Solution.

The XML::SAX::Base module provides stubs for the different methods that the XML parser calls on your handler. Those methods are listed in Table 22-2, and are the methods defined by the SAX1 and SAX2 standards at http://www.saxproject.org/. The Perl implementation uses more Perl-ish data structures and is described in the XML::SAX::Intro manpage.

Table 22-2. XML::SAX::Base methods

start_document

end_document

characters

start_element

end_element

processing_instruction

ignorable_whitespace

set_document_locator

skipped_entity

start_prefix_mapping

end_prefix_mapping

comment

start_cdata

end_cdata

entity_reference

notation_decl

unparsed_entity_decl

element_decl

attlist_decl

doctype_decl

xml_decl

entity_decl

attribute_decl

internal_entity_decl

start_dtd

end_dtd

external_entity_decl

resolve_entity

start_entity

end_entity

warning

error

fatal_error

The two data structures you need most often are those representing elements and attributes. The $data parameter to start_element and end_element is a hash reference. The keys of the hash are given in Table 22-3.

Table 22-3. An XML::SAX element hash

Key

Meaning

Prefix

XML namespace prefix (e.g., email:)

LocalName

Attribute name without prefix (e.g., to)

Name

Fully qualified attribute name (e.g., email:to)

Attributes

Hash of attributes of the element

NamespaceURI

URI of the XML namespace for this attribute

An attribute hash has a key for each attribute. The key is structured as "{namespaceURI}attrname". For example, if the current namespace URI is http://example.com/dtds/mailspec/ and the attribute is msgid, the key in the attribute hash is:

{http://example.com/dtds/mailspec/}msgid

The attribute value is a hash; its keys are given in Table 22-4.

Table 22-4. An XML::SAX attribute hash

Key

Meaning

Prefix

XML namespace prefix (e.g., email:)

LocalName

Element name without prefix (e.g., to)

Name

Fully qualified element name (e.g., email:to)

Value

Value of the attribute

NamespaceURI

URI of the XML namespace for this element

Example 22-4 shows how to list the book titles using SAX events. It's more complex than the DOM solution because with SAX we must keep track of where we are in the XML document.

Example 22-4. sax-titledumper

# in TitleDumper.pm
# TitleDumper.pm -- SAX handler to display titles in books file
package TitleDumper;

use base qw(XML::SAX::Base);

my $in_title = 0;

# if we're entering a title, increase $in_title
sub start_element {
  my ($self, $data) = @_;
  if ($data->{Name} eq 'title') {
    $in_title++;
  }
}

# if we're leaving a title, decrease $in_title and print a newline
sub end_element {
  my ($self, $data) = @_;
  if ($data->{Name} eq 'title') {
    $in_title--;
    print "\n";
  }
}

# if we're in a title, print any text we get
sub characters {
  my ($self, $data) = @_;
  if ($in_title) {
    print $data->{Data};
  }
}

1;

The XML::SAX::Intro manpage provides a gentle introduction to XML::SAX parsing.

22.3.4. See Also

Chapter 5 of Perl & XML; the documentation for the CPAN modules XML::SAX, XML::SAX::Base, and XML::SAX::Intro



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.