XML::Parser::PerlSAX supports another group of handlers used to process DTD events . It takes care of anything that appears before the root element, such as the XML declaration, doctype declaration, and the internal subset of entity and element declarations, which are collectively called the document prolog. If you want to output the document literally as you read it (e.g., in a filter program), you need to define some of these handlers to reproduce the document prolog. Defining these handlers is just what we needed in the previous example.
You can use these handlers for other purposes. For example, you may need to pre-load entity definitions for special processing rather than rely on the parser to do its default substitution for you. These handlers are listed in Table 5-2.
Method name |
Event |
Properties |
---|---|---|
The parser sees an entity declaration (internal or external, parsed or unparsed). |
Name, Value, PublicId, SystemId, Notation |
|
The parser found a notation declaration. |
Name, PublicId, SystemId, Base |
|
The parser found a declaration for an unparsed entity (e.g., a binary data entity). |
Name, PublicId, SystemId, Base |
|
An element declaration was found. |
Name, Model |
|
An element's attribute list declaration was encountered. |
ElementName, AttributeName, Type, Fixed |
|
The parser found the document type declaration. |
Name, SystemId, PublicId, Internal |
|
The XML declaration was encountered. |
Version, Encoding, Standalone |
The entity_decl( ) handler is called for all kinds of entity declarations unless a more specific handler is defined. Thus, unparsed entity declarations trigger the entity_decl( ) handler unless you've defined an unparsed_entity_decl( ), which will take precedence.
entity_decl( )'s parameters vary depending on the entity type. The Value parameter is set for internal entities, but not external ones. Likewise, PublicId and SystemId, parameters that tell an XML processor where to find the file containing the entity's value, is not set for internal entities, only external ones. Base tells the procesor what to use for a base URL if the SystemId contains a relative location.
Notation declarations are a special feature of DTDs that allow you to assign a special type identifier to an entity. For example, you could declare an entity to be of type "date" to tell the XML processor that the entity should be treated as that kind of data. It's not used very often in XML, so we won't go into it further.
The Model property of the element_decl( ) contains the content model, or grammar, for an element. This property describes what is allowed to go inside an element according to the DTD.
An attribute list declaration in a DTD can contain more than one attribute description. Fortunately, the parser breaks these descriptions up into individual calls to the attlist_decl( ) handler for each attribute.
The document type declaration is an optional part of the document at the top, just under the XML declaration. The parameter Name is the name of the root element in your document. PublicId and SystemId tell the processor where to find the external DTD. Finally, the Internal parameter contains the whole internal subset as a string, in case you want to skip the individual entity and element declaration handling.
As an example, let's say you wanted to add to the filter example code to output the document prolog exactly as it was encountered by the parser. You'd need to define handlers like the program in Example 5-4.
# handle xml declaration # sub xml_decl { my( $self, $properties ) = @_; output( "<?xml version=\"" . $properties->{'Version'} . "\"" ); my $encoding = $properties->{'Encoding'}; output( " encoding=\"$encoding\"" ) if( $encoding ); my $standalone = $properties->{'Standalone'}; output( " standalone=\"$standalone\"" ) if( $standalone ); output( "?>\n" ); } # # handle doctype declaration: # try to duplicate the original # sub doctype_decl { my( $self, $properties ) = @_; output( "\n<!DOCTYPE " . $properties->{'Name'} . "\n" ); my $pubid = $properties->{'PublicId'}; if( $pubid ) { output( " PUBLIC \"$pubid\"\n" ); output( " \"" . $properties->{'SystemId'} . "\"\n" ); } else { output( " SYSTEM \"" . $properties->{'SystemId'} . "\"\n" ); } my $intset = $properties->{'Internal'}; if( $intset ) { $in_intset = 1; output( "[\n" ); } else { output( ">\n" ); } } # # handle entity declaration in internal subset: # recreate the original declaration as it was # sub entity_decl { my( $self, $properties ) = @_; my $name = $properties->{'Name'}; output( "<!ENTITY $name " ); my $pubid = $properties->{'PublicId'}; my $sysid = $properties->{'SystemId'}; if( $pubid ) { output( "PUBLIC \"$pubid\" \"$sysid\"" ); } elsif( $sysid ) { output( "SYSTEM \"$sysid\"" ); } else { output( "\"" . $properties->{'Value'} . "\"" ); } output( ">\n" ); }
Now let's see how the output from our filter looks. The result is in Example 5-5.
<?xml version="1.0"?> <!DOCTYPE book SYSTEM "/usr/local/prod/sgml/db.dtd" [ <!ENTITY thingy "hoo hah blah blah"> ]> <book id="mybook"> <title>GRXL in a Nutshell</title> <chapter id="intro"> <title>What is GRXL?</title> <comment> need a better title </comment> <para> Yet another acronym. That was our attitude at first, but then we saw the amazing uses of this new technology called <literal>GRXL</literal>. Consider the following program: </para> <programlisting>AH aof -- %%%% {{{{{{ let x = 0 }}}}}} print! <lineannotation>wow</lineannotation> or not!</programlisting> <comment> what font should we use? </comment> <para> What does it do? Who cares? It's just lovely to look at. In fact, I'd have to say, "&thingy;". </para> </chapter> </book>
That's much better. Now we have a complete filter program. The basic handlers take care of elements and everything inside them. The DTD handlers deal with whatever happens outside of the root element.
Copyright © 2002 O'Reilly & Associates. All rights reserved.