Parsing XML into a DOM Tree (Perl Cookbook, 2nd Edition)

22.2.1. Problem

You want to use the Document Object Model (DOM) to access and perhaps change the parse tree of an XML file.

22.2.2. Solution

use XML::LibXML;
my $parser = XML::LibXML->new( );
my $dom    = $parser->parse_string($XML);
# or
my $dom    = $parser->parse_file($FILENAME);
my $root   = $dom->getDocumentElement;

22.2.3. Discussion

DOM is a framework of classes for representing XML parse trees. Each element is a node in the tree, with which you can do operations like find its children nodes (the XML elements in this case), add another child node, and move the node somewhere else in the tree. The parse_string, parse_file, and parse_fh (filehandle) constructors all return a DOM object that you can use to find nodes in the tree.

For example, given the books XML from Example 22-1, Example 22-2 shows one way to print the titles.

Example 22-2. dom-titledumper

#!/usr/bin/perl -w
# dom-titledumper -- display titles in books file using DOM

use XML::LibXML;
use Data::Dumper;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("books.xml") or die;

# get all the title elements
my @titles = $dom->getElementsByTagName("title");
foreach my $t (@titles) {
    # get the text node inside the <title> element, and print its value
    print $t->firstChild->data, "\n";
}

The getElementsByTagName method returns a list of elements as nodes within the document that have the specific tag name. Here we get a list of the title elements, then go through each title to find its contents. We know that each title has only a single piece of text, so we assume the first child node is text and print its contents.

If we wanted to confirm that the node was a text node, we could say:

die "the title contained something other than text!"
  if $t->firstChild->nodeType != 3;

This ensures that the first node is of type 3 (text). Table 22-1 shows LibXML's numeric node types, which the nodeType method returns.

Table 22-1. LibXML's numeric node types

Node type	Number
Element	1
Attribute	2
Text	3
CDATA Section	4
Entity Ref	5
Entity	6
Processing Instruction	7
Comment	8
Document	9
Document Type	10
Document Fragment	11
Notation	12
HTML Document	13
DTD Node	14
Element Decl	15
Attribute Decl	16
Entity Decl	17
Namespace Decl	18
XInclude Start	19
XInclude End	20

You can also create and insert new nodes, or move and delete existing ones, to change the parse tree. Example 22-23 shows how you would add a randomly generated price value to each book element.

Example 22-3. dom-addprice

#!/usr/bin/perl -w
# dom-addprice -- add price element to books

use XML::LibXML;
use Data::Dumper;
use strict;

my $parser = XML::LibXML->new;
my $dom = $parser->parse_file("books.xml") or die;
my $root = $dom->documentElement;

# get list of all the "book" elements
my @books = $root->getElementsByTagName("book");

foreach my $book (@books) {
  my $price = sprintf("\$%d.95", 19 + 5 * int rand 5); # random price
  my $price_text_node = $dom->createTextNode($price);  # contents of <price>
  my $price_element   = $dom->createElement("price");  # create <price>
  $price_element->appendChild($price_text_node);       # put contents into <price>
  $book->appendChild($price_element);                  # put <price> into <book>
}

print $dom->toString;

We use createTextNode and createElement to build the new price tag and its contents. Then we use appendChild to insert the tag onto the end of the current book tag's existing contents. The toString method emits a document as XML, which lets you easily write XML filters like this one using DOM.

The XML::LibXML::DOM manpage gives a quick introduction to the features of XML::LibXML's DOM support and references the manpages for the DOM classes (e.g., XML::LibXML::Node). Those manpages list the methods for the objects.

22.2.4. See Also

The documentation for the XML::LibXML::DOM, XML::LibXML::Document, XML::LibXML::Element, and XML::LibXML::Node modules


22.1. Parsing XML into Data Structures		22.3. Parsing XML into SAX Events