Introducing CGI and mod_perl (Practical mod_perl)

This chapter provides the foundations on which the rest of the book builds. In this chapter, we give you:

1.1. A Brief History of CGI

When the World Wide Web was born, there was only one web server and one web client. The httpd web server was developed by the Centre d'Etudes et de Recherche Nucléaires (CERN) in Geneva, Switzerland. httpd has since become the generic name of the binary executable of many web servers. When CERN stopped funding the development of httpd, it was taken over by the Software Development Group of the National Center for Supercomputing Applications (NCSA). The NCSA also produced Mosaic, the first web browser, whose developers later went on to write the Netscape client.

Mosaic could fetch and view static documents [2] and images served by the httpd server. This provided a far better means of disseminating information to large numbers of people than sending each person an email. However, the glut of online resources soon made search engines necessary, which meant that users needed to be able to submit data (such as a search string) and servers needed to process that data and return appropriate content.

[2]A static document is one that exists in a constant state, such as a text file that doesn't change.

Search engines were first implemented by extending the web server, modifying its source code directly. Rewriting the source was not very practical, however, so the NCSA developed the Common Gateway Interface (CGI) specification. CGI became a standard for interfacing external applications with web servers and other information servers and generating dynamic information.

A CGI program can be written in virtually any language that can read from STDIN and write to STDOUT, regardless of whether it is interpreted (e.g., the Unix shell), compiled (e.g., C or C++), or a combination of both (e.g., Perl). The first CGI programs were written in C and needed to be compiled into binary executables. For this reason, the directory from which the compiled CGI programs were executed was named cgi-bin, and the source files directory was named cgi-src. Nowadays most servers come with a preconfigured directory for CGI programs called, as you have probably guessed, cgi-bin.

1.1.1. The HTTP Protocol

Interaction between the browser and the server is governed by the HyperText Transfer Protocol (HTTP), now an official Internet standard maintained by the World Wide Web Consortium (W3C). HTTP uses a simple request/response model: the client establishes a TCP [3] connection to the server and sends a request, the server sends a response, and the connection is closed. Requests and responses take the form of messages. A message is a simple sequence of text lines.

[3]TCP/IP is a low-level Internet protocol for transmitting bits of data, regardless of its use.

HTTP messages have two parts. First come the headers, which hold descriptive information about the request or response. The various types of headers and their possible content are fully specified by the HTTP protocol. Headers are followed by a blank line, then by the message body. The body is the actual content of the message, such as an HTML page or a GIF image. The HTTP protocol does not define the content of the body; rather, specific headers are used to describe the content type and its encoding. This enables new content types to be incorporated into the Web without any fanfare.

HTTP is a stateless protocol. This means that requests are not related to each other. This makes life simple for CGI programs: they need worry about only the current request.

1.1.2. The Common Gateway Interface Specification

If you are new to the CGI world, there's no need to worry—basic CGI programming is very easy. Ninety percent of CGI-specific code is concerned with reading data submitted by a user through an HTML form, processing it, and returning some response, usually as an HTML document.

In this section, we will show you how easy basic CGI programming is, rather than trying to teach you the entire CGI specification. There are many books and online tutorials that cover CGI in great detail (see http://hoohoo.ncsa.uiuc.edu/). Our aim is to demonstrate that if you know Perl, you can start writing CGI scripts almost immediately. You need to learn only two things: how to accept data and how to generate output.

The HTTP protocol makes clients and servers understand each other by transferring all the information between them using headers, where each header is a key-value pair. When you submit a form, the CGI program looks for the headers that contain the input information, processes the received data (e.g., queries a database for the keywords supplied through the form), and—when it is ready to return a response to the client—sends a special header that tells the client what kind of information it should expect, followed by the information itself. The server can send additional headers, but these are optional. Figure 1-1 depicts a typical request-response cycle.

Figure 1-1. Request-response cycle

Sometimes CGI programs can generate a response without needing any input data from the client. For example, a news service may respond with the latest stories without asking for any input from the client. But if you want stories for a specific day, you have to tell the script which day's stories you want. Hence, the script will need to retrieve some input from you.

To get your feet wet with CGI scripts, let's look at the classic "Hello world" script for CGI, shown in Example 1-1.

Example 1-1. "Hello world" script

  #!/usr/bin/perl -Tw
  
  print "Content-type: text/plain\n\n";
  print "Hello world!\n";

We start by sending a Content-type header, which tells the client that the data that follows is of plain-text type. text/plain is a Multipurpose Internet Mail Extensions (MIME) type. You can find a list of widely used MIME types in the mime.types file, which is usually located in the directory where your web server's configuration files are stored.[4] Other examples of MIME types are text/html (text in HTML format) and video/mpeg (an MPEG stream).

[4]For more information about Internet media types, refer to RFCs 2045, 2046, 2047, 2048, and 2077, accessible from http://www.rfc-editor.org/.

According to the HTTP protocol, an empty line must be sent after all headers have been sent. This empty line indicates that the actual response data will start at the next line.[5]

[5]The protocol specifies the end of a line as the character sequence Ctrl-M and Ctrl-J (carriage return and newline). On Unix and Windows systems, this sequence is expressed in a Perl string as \015\012, but Apache also honors \n, which we will use throughout this book. On EBCDIC machines, an explicit \r\nshould be used instead.

Now save the code in hello.pl, put it into a cgi-bin directory on your server, make the script executable, and test the script by pointing your favorite browser to:

http://localhost/cgi-bin/hello.pl

It should display the same output as Figure 1-2.

Figure 1-2. Hello world

A more complicated script involves parsing input data. There are a few ways to pass data to the scripts, but the most commonly used are the GET and POST methods. Let's write a script that expects as input the user's name and prints this name in its response. We'll use the GET method, which passes data in the request URI (uniform resource indicator):

http://localhost/cgi-bin/hello.pl?username=Doug

When the server accepts this request, it knows to split the URI into two parts: a path to the script (http://localhost/cgi-bin/hello.pl) and the "data" part (username=Doug, called the QUERY_STRING). All we have to do is parse the data portion of the URI and extract the key username and value Doug. The GET method is used mostly for hardcoded queries, where no interactive input is needed. Assuming that portions of your site are dynamically generated, your site's menu might include the following HTML code:

<a href="/cgi-bin/display.pl?section=news">News</a><br>
<a href="/cgi-bin/display.pl?section=stories">Stories</a><br>
<a href="/cgi-bin/display.pl?section=links">Links</a><br>

Another approach is to use an HTML form, where the user fills in some parameters. The HTML form for the "Hello user" script that we will look at in this section can be either:

<form action="/cgi-bin/hello_user.pl" method="POST">
<input type="text" name="username">
<input type="submit">
</form>

or:

<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="username">
<input type="submit">
</form>

Note that you can use either the GET or POST method in an HTML form. However, POSTshould be used when the query has side effects, such as changing a record in a database, while GETshould be used in simple queries like this one (simple URL links are GET requests).[6]

[6]See Axioms of Web Architecture at http://www.w3.org/DesignIssues/Axioms.html#state.

Formerly, reading input data required different code, depending on the method used to submit the data. We can now use Perl modules that do all the work for us. The most widely used CGI library is the CGI.pm module, written by Lincoln Stein, which is included in the Perl distribution. Along with parsing input data, it provides an easy API to generate the HTML response.

Our sample "Hello user" script is shown in Example 1-2.

Example 1-2. "Hello user" script

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  my $username = param('username') || "unknown";
  
  print "Content-type: text/plain\n\n";
  print "Hello $username!\n";

Notice that this script is only slightly different from the previous one. We've pulled in the CGI.pm module, importing a group of functions called :standard. We then used its param( ) function to retrieve the value of the username key. This call will return the name submitted by any of the three ways described above (a form using either POST, GET, or a hardcoded name with GET; the last two are essentially the same). If no value was supplied in the request, param( ) returns undef.

my $username = param('username') || "unknown";

$username will contain either the submitted username or the string "unknown" if no value was submitted. The rest of the script is unchanged—we send the MIME header and print the "Hello $username!" string.[7]

[7]All scripts shown here generate plain text, not HTML. If you generate HTML output, you have to protect the incoming data from cross-site scripting. For more information, refer to the CERT advisory at http://www.cert.org/advisories/CA-2000-02.html.

As we've just mentioned, CGI.pm can help us with output generation as well. We can use it to generate MIME headers by rewriting the original script as shown in Example 1-3.

Example 1-3. "Hello user" script using CGI.pm

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  my $username = param('username') || "unknown";
  
  print header("text/plain");
  print "Hello $username!\n";

To help you learn how CGI.pm copes with more than one parameter, consider the code in Example 1-4.

Example 1-4. CGI.pm and param( ) method

  #!/usr/bin/perl
  
  use CGI qw(:standard);
  print header("text/plain");
  
  print "The passed parameters were:\n";
  for my $key ( param( ) ) {
      print "$key => ", param($key), "\n";
  }

Now issue the following request:

http://localhost/cgi-bin/hello_user.pl?a=foo&b=bar&c=foobar

Separating key=value Pairs

Note that & or ; usually is used to separate the key=value pairs. The former is less preferable, because if you end up with a QUERY_STRING of this format:

id=foo&reg=bar

some browsers will interpret &reg as an SGML entity and encode it as ®. This will result in a corrupted QUERY_STRING:

id=foo®=bar

You have to encode & as & if it is included in HTML. You don't have this problem if you use ; as a separator:

id=foo;reg=bar

Both separators are supported by CGI.pm, Apache::Request, and mod_perl's args( ) method, which we will use in the examples to retrieve the request parameters.

Of course, the code that builds QUERY_STRING has to ensure that the values don't include the chosen separator and encode it if it is used. (See RFC2854 for more details.)

The browser will display:

The passed parameters were:
a => foo
b => bar
c => foobar

Now generate this form:

<form action="/cgi-bin/hello_user.pl" method="GET">
<input type="text" name="firstname">
<input type="text" name="lastname">
<input type="submit">
</form>

If we fill in only the firstname field with the value Doug, the browser will display:

The passed parameters were:
firstname => Doug
lastname =>

If in addition the lastname field is MacEachern, you will see:

The passed parameters were:
firstname => Doug
lastname => MacEachern

These are just a few of the many functions CGI.pm offers. Read its manpage for detailed information by typing perldoc CGI at your command prompt.

We used this long CGI.pm example to demonstrate how simple basic CGI is. You shouldn't reinvent the wheel; use standard tools when writing your own scripts, and you will save a lot of time. Just as with Perl, you can start creating really cool and powerful code from the very beginning, gaining more advanced knowledge over time. There is much more to know about the CGI specification, and you will learn about some of its advanced features in the course of your web development practice. We will cover the most commonly used features in this book.

For now, let CGI.pm or an equivalent library handle the intricacies of the CGI specification, and concentrate your efforts on the core functionality of your code.

1.1.3. Apache CGI Handling with mod_cgi

The Apache server processes CGI scripts via an Apache module called mod_cgi. (See later in this chapter for more information on request-processing phases and Apache modules.) mod_cgi is built by default with the Apache core, and the installation procedure also preconfigures a cgi-bin directory and populates it with a few sample CGI scripts. Write your script, move it into the cgi-bin directory, make it readable and executable by the web server, and you can start using it right away.

Should you wish to alter the default configuration, there are only a few configuration directives that you might want to modify. First, the ScriptAlias directive:

ScriptAlias /cgi-bin/ /home/httpd/cgi-bin/

ScriptAlias controls which directories contain server scripts. Scripts are run by the server when requested, rather than sent as documents.

When a request is received with a path that starts with /cgi-bin, the server searches for the file in the /home/httpd/cgi-bin directory. It then runs the file as an executable program, returning to the client the generated output, not the source listing of the file.

The other important part of httpd.conf specifies how the files in cgi-bin should be treated:

<Directory /home/httpd/cgi-bin>
    Options FollowSymLinks
    Order allow,deny
    Allow from all
</Directory>

The above setting allows the use of symbolic links in the /home/httpd/cgi-bin directory. It also allows anyone to access the scripts from anywhere.

mod_cgi provides access to various server parameters through environment variables. The script in Example 1-5 will print these environment variables.

Example 1-5. Checking environment variables

  #!/usr/bin/perl
  
  print "Content-type: text/plain\n\n";
  for (keys %ENV) {
      print "$_ => $ENV{$_}\n";
  }

Save this script as env.pl in the directory cgi-bin and make it executable and readable by the server (that is, by the username under which the server runs). Point your browser to http://localhost/cgi-bin/env.pl and you will see a list of parameters similar to this one:

SERVER_SOFTWARE => Server: Apache/1.3.24 (Unix) mod_perl/1.26 
                   mod_ssl/2.8.8 OpenSSL/0.9.6
GATEWAY_INTERFACE => CGI/1.1
DOCUMENT_ROOT => /home/httpd/docs
REMOTE_ADDR => 127.0.0.1
SERVER_PROTOCOL => HTTP/1.0
REQUEST_METHOD => GET
QUERY_STRING =>
HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0
SERVER_ADDR => 127.0.0.1
SCRIPT_NAME => /cgi-bin/env.pl
SCRIPT_FILENAME => /home/httpd/cgi-bin/env.pl

Your code can access any of these variables with $ENV{"somekey"}. However, some variables can be spoofed by the client side, so you should be careful if you rely on them for handling sensitive information. Let's look at some of these environment variables.

SERVER_SOFTWARE => Server: Apache/1.3.24 (Unix) mod_perl/1.26 
                   mod_ssl/2.8.8 OpenSSL/0.9.6

The SERVER_SOFTWARE variable tells us what components are compiled into the server, and their version numbers. In this example, we used Apache 1.3.24, mod_perl 1.26, mod_ssl 2.8.8, and OpenSSL 0.9.6.

GATEWAY_INTERFACE => CGI/1.1

The GATEWAY_INTERFACE variable is very important; in this example, it tells us that the script is running under mod_cgi. When running under mod_perl, this value changes to CGI-Perl/1.1.

REMOTE_ADDR => 127.0.0.1

The REMOTE_ADDR variable tells us the remote address of the client. In this example, both client and server were running on the same machine, so the client is localhost (whose IP is 127.0.0.1).

SERVER_PROTOCOL => HTTP/1.0

The SERVER_PROTOCOL variable reports the HTTP protocol version upon which the client and the server have agreed. Part of the communication between the client and the server is a negotiation of which version of the HTTP protocol to use. The highest version the two can understand will be chosen as a result of this negotiation.

REQUEST_METHOD => GET

The now-familiar REQUEST_METHOD variable tells us which request method was used (GET, in this case).

QUERY_STRING =>

The QUERY_STRING variable is also very important. It is used to pass the query parameters when using the GET method. QUERY_STRING is empty in this example, because we didn't pass any parameters.

HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0

The HTTP_USER_AGENT variable contains the user agent specifications. In this example, we are using Galeon on Linux. Note that this variable is very easily spoofed.

Spoofing HTTP_USER_AGENT

If the client is a custom program rather than a widely used browser, it can mimic its bigger brother's signature. Here is an example of a very simple client using the LWP library:

#!/usr/bin/perl -w use LWP::UserAgent; my $ua = new LWP::UserAgent; $ua->agent("Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0"); my $req = new HTTP::Request('GET', 'http://localhost/cgi-bin/env.pl'); my $res = $ua->request($req); print $res->content if $res->is_success;

This script first creates an instance of a user agent, with a signature identical to Galeon's on Linux. It then creates a request object, which is passed to the user agent for processing. The response content is received and printed.

When run from the command line, the output of this script is strikingly similar to what we obtained with the browser. It notably prints:

HTTP_USER_AGENT => Mozilla/5.0 Galeon/1.2.1 (X11; Linux i686; U;) Gecko/0

So you can see how easy it is to fool a naïve CGI programmer into thinking we've used Galeon as our client program.

SERVER_ADDR => 127.0.0.1
SCRIPT_NAME => /cgi-bin/env.pl
SCRIPT_FILENAME => /home/httpd/cgi-bin/env.pl

The SERVER_ADDR, SCRIPT_NAME, and SCRIPT_FILENAME variables tell us (respectively) the server address, the name of the script as provided in the request URI, and the real path to the script on the filesystem.

Now let's get back to the QUERY_STRING parameter. If we submit a new request for http://localhost/cgi-bin/env.pl?foo=ok&bar=not_ok, the new value of the query string is displayed: