start page | rating of books | rating of authors | reviews | copyrights

Perl CookbookPerl CookbookSearch this book

22.2. Parsing XML into a DOM Tree

22.2.1. Problem

You want to use the Document Object Model (DOM) to access and perhaps change the parse tree of an XML file.

22.2.2. Solution

Use the XML::LibXML module from CPAN:

use XML::LibXML;
my $parser = XML::LibXML->new( );
my $dom    = $parser->parse_string($XML);
# or
my $dom    = $parser->parse_file($FILENAME);
my $root   = $dom->getDocumentElement;

22.2.3. Discussion

DOM is a framework of classes for representing XML parse trees. Each element is a node in the tree, with which you can do operations like find its children nodes (the XML elements in this case), add another child node, and move the node somewhere else in the tree. The parse_string, parse_file, and parse_fh (filehandle) constructors all return a DOM object that you can use to find nodes in the tree.

For example, given the books XML from Example 22-1, Example 22-2 shows one way to print the titles.

Example 22-2. dom-titledumper

#!/usr/bin/perl -w
# dom-titledumper -- display titles in books file using DOM

use XML::LibXML;
use Data::Dumper;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("books.xml") or die;

# get all the title elements
my @titles = $dom->getElementsByTagName("title");
foreach my $t (@titles) {
    # get the text node inside the <title> element, and print its value
    print $t->firstChild->data, "\n";
}

The getElementsByTagName method returns a list of elements as nodes within the document that have the specific tag name. Here we get a list of the title elements, then go through each title to find its contents. We know that each title has only a single piece of text, so we assume the first child node is text and print its contents.

If we wanted to confirm that the node was a text node, we could say:

die "the title contained something other than text!"
  if $t->firstChild->nodeType != 3;

This ensures that the first node is of type 3 (text). Table 22-1 shows LibXML's numeric node types, which the nodeType method returns.

Table 22-1. LibXML's numeric node types

Node type

Number

Element

1

Attribute

2

Text

3

CDATA Section

4

Entity Ref

5

Entity

6

Processing Instruction

7

Comment

8

Document

9

Document Type

10

Document Fragment

11

Notation

12

HTML Document

13

DTD Node

14

Element Decl

15

Attribute Decl

16

Entity Decl

17

Namespace Decl

18

XInclude Start

19

XInclude End

20

You can also create and insert new nodes, or move and delete existing ones, to change the parse tree. Example 22-23 shows how you would add a randomly generated price value to each book element.

Example 22-3. dom-addprice

#!/usr/bin/perl -w
# dom-addprice -- add price element to books

use XML::LibXML;
use Data::Dumper;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("books.xml") or die;
my $root = $dom->documentElement;

# get list of all the "book" elements
my @books = $root->getElementsByTagName("book");

foreach my $book (@books) {
  my $price = sprintf("\$%d.95", 19 + 5 * int rand 5); # random price
  my $price_text_node = $dom->createTextNode($price);  # contents of <price>
  my $price_element   = $dom->createElement("price");  # create <price>
  $price_element->appendChild($price_text_node);       # put contents into <price>
  $book->appendChild($price_element);                  # put <price> into <book>
}

print $dom->toString;

We use createTextNode and createElement to build the new price tag and its contents. Then we use appendChild to insert the tag onto the end of the current book tag's existing contents. The toString method emits a document as XML, which lets you easily write XML filters like this one using DOM.

The XML::LibXML::DOM manpage gives a quick introduction to the features of XML::LibXML's DOM support and references the manpages for the DOM classes (e.g., XML::LibXML::Node). Those manpages list the methods for the objects.

22.2.4. See Also

The documentation for the XML::LibXML::DOM, XML::LibXML::Document, XML::LibXML::Element, and XML::LibXML::Node modules



Library Navigation Links

Copyright © 2003 O'Reilly & Associates. All rights reserved.