If you've done any web hacking with Perl before, then you've kinda-sorta used XML, since HTML isn't too far off from the well-formedness goals of XML, at least in theory. In practice, HTML is used more frequently as a combination of markup, punctuation, embedded scripts, and a dozen other things that make web pages act nutty (with most popular web browsers being rather forgiving about syntax).
Currently, and probably for a long time to come, the language of the Web remains HTML. While you can use bona fide XML in your web pages by clinging to the W3C's XHTML,[40] it's far more likely that you'll need to turn it into HTML when you want to apply your XML to the Web.
[40]XHTML comes in two flavors. We prefer the less pendantic "transitional" flavor, which chooses to look the other way when one commits egregious sins (such as using the <font> tag instead of the preferred method of applying cascading stylesheets).
You can go about this in many ways. The most sledgehammery of these involves parsing your document and tossing out the results in a CGI script. This example reads a local MonkeyML file of my pet monkeys' names, and prints a web page to standard output (using Lincoln Stein's ubiquitous CGI module to add a bit of syntactic sugar):
#!/usr/bin/perl use warnings; use strict; use CGI qw(:standard); use XML::LibXML; my $parser = XML::XPath; my $doc = $parser->parse_file('monkeys.xml'); print header; print start_html("My Pet Monkeys"); print h1("My Pet Monkeys"); print p("I have the following monkeys in my house:"); print "<ul>\n"; foreach my $name_node ($doc->documentElement->findnodes("//mm:name")) { print "<li>" . $name_node->firstChild->getData ."</li>\n"; } print end_html;
Another approach involves XSLT.
XSLT is used to translate one type of XML into another. XSLT factors in strongly here because using XML and the Web often requires that you extract all the presentable pieces of information from an XML document and wrap them up in HTML. One very high-level XML-using application, Matt Sergeant's AxKit (http://www.axkit.org), bases an entire application server framework around this notion, letting you set up a web site that uses XML as its source files, but whose final output to web browsers is HTML (and whose final output to other devices is whatever format best applies to them).
Let's make a little module that converts DocBook files into HTML on the fly. Though our goals are not as ambitious as AxKit's, we'll still take a cue from that program by basing our code around the Apache mod_perl module. mod_perl drops a Perl interpreter inside the Apache web server, and thus allows one to write Perl code that makes all sorts of interesting things happen with requests to the server.
We'll use a couple of mod_perl's basic features here by writing a Perl module with a handler subroutine, the standard name for mod_perl callbacks; it will be passed an object representing the Apache request, and from this object, we'll determine what (if anything) the user sees.
WARNING: A frequent source of frustration for people running Perl and XML programs in an Apache environment comes from Apache itself, or at least the way it behaves if it's not given a few extra configuration directives when compiled. The standard Apache distribution comes with a version of the Expat C libraries, which it will bake into its binary if not explicitly told otherwise. Unfortunately, these libraries often conflict with XML::Parser's calls to Expat libraries elsewhere on the system, resulting in nasty errors (such as segmentation faults on Unix) when they collide.The Apache development community has reportedly considered quietly removing this feature in future releases, but currently, it may be necessary for Perl hackers wishing to invoke Expat (usually by way of XML::Parser) to recompile Apache without it (by setting the EXPAT configuration option to no).
The cheaper workaround involves using a low-level parsing module that doesn't use Expat, such as XML::LibXML or members of the newer XML::SAX family.
We begin by doing the "starting to type in the module" dance, and then digging into that callback sub:
package Apache::DocBook; use warnings; use strict; use Apache::Constants qw(:common); use XML::LibXML; use XML::LibXSLT; our $xml_path; # Document source directory our $base_path; # HTML output directory our $xslt_file; # path to DocBook-to-HTML XSLT stylesheet our $icon_dir; # path to icons used in index pages sub handler { my $r = shift; # Apache request object # Get config info from Apache config $xml_path = $r->dir_config('doc_dir') or die "doc_dir variable not set.\n"; $base_path = $r->dir_config('html_dir') or die "html_dir variable not set.\n"; $icon_dir = $r->dir_config('icon_dir') or die "icon_dir variable not set.\n"; unless (-d $xml_path) { $r->log_reason("Can't use an xml_path of $xml_path: $!", $r->filename); die; } my $filename = $r->filename; $filename =~ s/$base_path\/?//; # Add in path info (the file might not actually exist... YET) $filename .= $r->path_info; $xslt_file = $r->dir_config('xslt_file') or die "xslt_file Apache variable not set.\n"; # The subroutines we'll call after this will take care of printing # stuff at the client. # Is this an index request? if ( (-d "$xml_path/$filename") or ($filename =~ /index.html?$/) ) { # Why yes! We whip up an index page from the floating aethers. my ($dir) = $filename =~ /^(.*)(\/index.html?)?$/; # Semi-hack: stick trailing slash on URI, maybe. if (not($2) and $r->uri !~ /\/$/) { $r->uri($r->uri . '/'); } make_index_page($r, $dir); return $r->status; } else { # No, it's a request for some other page. make_doc_page($r, $filename); return $r->status; } return $r->status; }
This subroutine performs the actual XSLT transformation, given a filename of the original XML source and another filename to which it should write the transformed HTML output:
sub transform { my ($filename, $html_filename) = @_; # make sure there's a home for this file. maybe_mkdir($filename); my $parser = XML::LibXML->new; my $xslt = XML::LibXSLT->new; # Because libxslt seems a little broken, we have to chdir to the # XSLT file's directory, else its file includes won't work. ;b use Cwd; # so we can get the current working dir my $original_dir = cwd; my $xslt_dir = $xslt_file; $xslt_dir =~ s/^(.*)\/.*$/$1/; chdir($xslt_dir) or die "Can't chdir to $xslt_dir: $!"; my $source = $parser->parse_file("$xml_path/$filename"); my $style_doc = $parser->parse_file($xslt_file); my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); open (HTML_OUT, ">$base_path/$html_filename"); print HTML_OUT $stylesheet->output_string($results); close (HTML_OUT); # Go back to original dir chdir($original_dir) or die "Can't chdir to $original_dir: $!"; }
Now we have a pair of subroutines to generate index pages. Unlike the document pages, which are the product of an XSLT transformation, we make the index pages from scratch, the bulk of its content being a table filled with information we grab from the document via XPath (looking first in the appropriate metadata element if present, and falling back to other bits of information if not).
sub make_index_page { my ($r, $dir) = @_; # If there's no corresponding dir in the XML source, the request # goes splat my $xml_dir = "$xml_path/$dir"; unless (-r $xml_dir) { unless (-d $xml_dir) { # Whoops, this ain't a directory. $r->status( NOT_FOUND ); return; } # It's a directory, but we can't read it. Whatever. $r->status( FORBIDDEN ); return; } # Fetch mtimes of this dir and the index.html in the corresponding # html dir my $index_file = "$base_path/$dir/index.html"; my $xml_mtime = (stat($xml_dir))[9]; my $html_mtime = (stat($index_file))[9]; # If the index page is older than the XML dir, or if it simply # doesn't exist, we generate a new one. if ((not($html_mtime)) or ($html_mtime <= $xml_mtime)) { generate_index($xml_dir, "$base_path/$dir", $r->uri); $r->filename($index_file); send_page($r, $index_file); return; } else { # The cached index page is fine. Let Apache serve it. $r->filename($index_file); $r->path_info(''); send_page($r, $index_file); return; } } sub generate_index { my ($xml_dir, $html_dir, $base_dir) = @_; # Snip possible trailing / from base_dir $base_dir =~ s|/$||; my $index_file = "$html_dir/index.html"; my $local_dir; if ($html_dir =~ /^$base_path\/*(.*)/) { $local_dir = $1; } # make directories, if necessary maybe_mkdir($local_dir); open (INDEX, ">$index_file") or die "Can't write to $index_file: $!"; opendir(DIR, $xml_dir) or die "Couldn't open directory $xml_dir: $!"; chdir($xml_dir) or die "Couldn't chdir to $xml_dir: $!"; # Set icon files my $doc_icon = "$icon_dir/generic.gif"; my $dir_icon = "$icon_dir/folder.gif"; # Make the displayable name of $local_dir (probably the same) my $local_dir_label = $local_dir || 'document root'; # Print start of page print INDEX <<END; <html> <head><title>Index of $local_dir_label</title></head> <body> <h1>Index of $local_dir_label</h1> <table width="100%"> END # Now print one row per file in this dir while (my $file = readdir(DIR)) { # ignore dotfiles & directories & stuff if (-f $file && $file !~ /^\./) { # Create parser objects my $parser = XML::LibXML->new; # Check for well-formedness, skip if yukky: eval {$parser->parse_file($file);}; if ($@) { warn "Blecch, not a well-formed XML file."; warn "Error was: $@"; next; } my $doc = $parser->parse_file($file); my %info; # Will hold presentable info # Determine root type my $root = $doc->documentElement; my $root_type = $root->getName; # Now try to get an appropriate info node, which is named $FOOinfo my ($info) = $root->findnodes("${root_type}info"); if ($info) { # Yay, an info element for us. Fill it %info with it. if (my ($abstract) = $info->findnodes('abstract')) { $info{abstract} = $abstract->string_value; } elsif ($root_type eq 'reference') { # We can usee first refpurpose as our abstract instead. if ( ($abstract) = $root->findnodes('/reference/refentry/refnamediv/refpurpose')) { $info{abstract} = $abstract->string_value; } } if (my ($date) = $info->findnodes('date')) { $info{date} = $date->string_value; } } if (my ($title) = $root->findnodes('title')) { $info{title} = $title->string_value; } # Fill in %info stuff we don't need the XML for... unless ($info{date}) { my $mtime = (stat($file))[9]; $info{date} = localtime($mtime); } $info{title} ||= $file; # That's enough info. Let's build an HTML table row with it. print INDEX "<tr>\n"; # Figure out a filename to link to -- foo.html my $html_file = $file; $html_file =~ s/^(.*)\..*$/$1.html/; print INDEX "<td>"; print INDEX "<img src=\"$doc_icon\">" if $doc_icon; print INDEX "<a href=\"$base_dir/$html_file\">$info{title}</a></td> "; foreach (qw(abstract date)) { print INDEX "<td>$info{$_}</td> " if $info{$_}; } print INDEX "\n</tr>\n"; } elsif (-d $file) { # Just make a directory link. # ...unless it's an ignorable directory. next if grep (/^$file$/, qw(RCS CVS .)) or ($file eq '..' and not $local_dir); print INDEX "<tr>\n<td>"; print INDEX "<a href=\"$base_dir/$file\"><img src=\"$dir_icon\">" if $dir_icon; print INDEX "$file</a></td>\n</tr>\n"; } } # Close the table and end the page print INDEX <<END; </table> </body> </html> END close(INDEX) or die "Can't close $index_file: $!"; closedir(DIR) or die "Can't close $xml_dir: $!"; }
These subroutines build on the transformation subroutine by generating the pages. Note the use of caching by comparing the timestamps of the source DocBook file and the destination HTML file and rewriting the latter only if it's older than the former. (Of course, if there's no HTML at all, then it always creates a new web page.)
sub make_doc_page { my ($r, $html_filename) = @_; # Generate a source filename by replacing existing extension with .xml my $xml_filename = $html_filename; $xml_filename =~ s/^(.*)(?:\..*)$/$1.xml/; # If there's a problem reading the source XML file, so it goes with # the result HTML. unless (-r "$xml_path/$xml_filename") { unless (-e "$xml_path/$xml_filename") { $r->status( NOT_FOUND ); return; } else { # Exists, but no read permissions, shrug. $r->status( FORBIDDEN ); return; } } # Fetch mtimes of this file and corresponding html file my $xml_mtime = (stat("$xml_path/$xml_filename"))[9]; my $html_mtime = (stat("$base_path/$html_filename"))[9]; # If the html file is older than the xml XML file, or if it simply # doesn't exist, generate a new one if ((not($html_mtime)) or ($html_mtime <= $xml_mtime)) { transform($xml_filename, $html_filename); $r->filename("$base_path/$html_filename"); $r->status( DECLINED ); return; } else { # It's cached. Let Apache serve up the existing file. $r->status( DECLINED ); } } sub send_page { my ($r, $html_filename) = @_; # Problem here: if we're creating the file, we can't just write it # and say 'DECLINED', cuz the default server handle hits us with a # file-not-found. Until I find a better solution, I'll just spew # the file, and DECLINE only known cache-hits. $r->status( OK ); $r->send_http_header('text/html'); open(HTML, "$html_filename") or die "Couldn't read $base_path/$html_filename: $!"; while (<HTML>) { $r->print($_); } close(HTML) or die "Couldn't close $html_filename: $!"; return; }
Finally, we have a utility subroutine to help us with a tedious task: copying the structure of subdirectories in the cache directory that mirrors that of the source XML directory:
sub maybe_mkdir { # Given a path, make sure directories leading to it exist, mkdir-ing # any that dont. my ($filename) = @_; my @path_parts = split(/\//, $filename); # if the last one is a filename, toss it out. pop(@path_parts) if -f $filename; my $traversed_path = $base_path; foreach (@path_parts) { $traversed_path .= "/$_"; unless (-d $traversed_path) { mkdir ($traversed_path) or die "Can't mkdir $traversed_path: $!"; } } return 1; }
Copyright © 2002 O'Reilly & Associates. All rights reserved.