Hanging Processes: Detection and Diagnostics (Practical mod_perl)

Sometimes an httpd process might hang in the middle of processing a request. This may happen because of a bug in the code, such as being stuck in a while loop. Or it may be blocked in a system call, perhaps waiting indefinitely for an unavailable resource. To fix the problem, we need to learn in what circumstances the process hangs, so that we can reproduce the problem, which will allow us to uncover its cause.

21.7.1. Hanging Because of an Operating System Problem

Sometimes you can find a process hanging because of some kind of system problem. For example, if the processes was doing some disk I/O operation, it might get stuck in uninterruptible sleep ('D' disk wait in ps report, 'U' in top), which indicates either that something is broken in your kernel or that you're using NFS. Also, usually you find that you cannot kill -9 this process.

Another process that cannot be killed with kill -9 is a zombie process ('Z' disk wait in ps report, <defunc> in top), in which case the process is already dead and Apache didn't wait on it properly (of course, it can be some other process not related to Apache).

In the case of disk wait, you can actually get the wait channel from ps and look it up in your kernel symbol table to find out what resource it was waiting on. This might point the way to what component of the system is misbehaving, if the problem occurs frequently.

21.7.2. When a Process Might Hang

In Chapter 19, we discussed the concept of deadlock. This can happen when two processes are each trying to acquire locks on the resources held by the other. Neither process will release the lock it currently holds, and thus neither can acquire a lock on the second resource it desires.

This scenario is a very good candidate for code that might lead to a hanging process. Since usually the deadlock cannot be resolved without manual intervention, the two processes will hang, doing nothing and wasting system resources, while trying to acquire locks.

An infinite loop might lead to a hanging process as well. Moreover, such a loop will usually consume a lot of CPU resources and memory. You should be very careful when using while and similar loop constructs that are capable of creating endless loops.

A process relying on some external resource, for example when accessing a file over NFS, might hang if the mounted partition it tries to access is not available. Usually it takes a long time before the request times out, and in the meantime the process may hang.

There are many other reasons that a process might hang, but these are some of the most common.

21.7.3. Detecting Hanging Processes

It's not so easy to detect hanging processes. There is no way you can tell how long the request is taking to process by using plain system utilities such as ps and top. The reason is that each Apache process serves many requests without quitting. System utilities can tell how long the process has been running since its creation, but this information is useless in our case, since Apache processes normally run for extended periods.

However, there are a few approaches that can help to detect a hanging process. If the hanging process happens to demand lots of resources, it's quite easy to spot it by using the top utility. You will see the same process show up in the first few lines of the automatically refreshed report. (But often the hanging process uses few resources—e.g., when waiting for some event to happen.)

Another easy case is when some process thrashes the error_log file, writing millions of error messages there. Generally this process uses lots of resources and is also easily spotted by using top.

Two other tools that report the status of Apache processes are the mod_status module, which is usually accessed from the /server_status location, and the Apache::VMonitor module, covered in Chapter 5.

Both tools provide counters of requests processed per Apache process. You can watch the report for a few minutes and try to spot any process with an unchanging number of processed requests and a status of W (waiting). This means that it has hung.

But if you have 50 processes, it can be quite hard to spot such a process. Apache::Watchdog::RunAway is a hanging-processes monitor and terminator that implements this feature and could be used to solve this kind of problem. It's covered in Chapter 5.

If you have a really bad problem, where processes hang one after the other, the time will come when the number of hanging processes is equal to the value of MaxClients. This means that no more processes will be spawned. As far as the users are concerned, your server will be down. It is easy to detect this situation, attempt to resolve it, and notify the administrator using a simple crontab watchdog that periodically requests some very light script (see Chapter 5).

In the watchdog, you set a timeout appropriate for your service, which may be anything from a few seconds to a few minutes. If the server fails to respond before the timeout expires, the watchdog spots trouble and attempts to restart the server. After a restart an email report is sent to the administrator saying that there was a problem and whether or not the restart was successful.

If you get such reports constantly, something is wrong with your web service and you should review your code. Note that it's possible that your server is being overloaded by more requests than it can handle, so the requests are being queued and not processed for a while, which triggers the watchdog's alarm. If this is the case, you may need to add more servers or more memory, or perhaps split a single machine across a cluster of machines.

21.7.4. Determination of the Reason

Given the PID, there are three ways to find out where the server is hanging:

Deploy the Perl calls-tracing mechanism. This will allow you to spot the location of the Perl code that triggers the problem.
Use a system calls-tracing utility such as strace. This approach reveals low-level details about the misbehavior of some part of the system.
Use an interactive debugger such as gdb. When the process is stuck and you don't know what it was doing just before it got stuck, using gdb you can attach to this process and print its call stack, to reveal where the last call originated. Just like with strace, you see the C function call trace, not the Perl high-level function calls.

21.7.4.1. Using the Perl trace

To see where an httpd process is spinning, the first step is to add the following to your startup file:

package Book::StartUp;
use Carp ( );
$SIG{'USR2'} = sub { 
    Carp::confess("caught SIGUSR2!");
};

The above code assigns a signal handler for the USR2signal. This signal has been chosen because it's unlikely to be used by the other server components.

We can check the registered signal handlers with help of Apache::Status. Using this code, if we fetch the URL http://localhost/perl-status?sig we will see:

USR2 = \&Book::StartUp::_ _ANON_ _

where Book::StartUp is the name of the package declared in startup.pl.

After applying this server configuration, let's use the simple code in Example 21-10, where sleep(10000) will emulate a hanging process.

Example 21-10. debug/perl_trace.pl

local $|=1;
my $r = shift;
$r->send_http_header('text/plain');

print "[$$] Going to sleep\n";
hanging_sub( );

sub hanging_sub { sleep 10000; }

We execute the above script as http://localhost/perl/debug/perl_trace.pl. In the script we use $|=1; to unbuffer the STDOUT stream and we get the PID from the $$special variable.

Now we issue the kill command, using the PID we have just seen printed to the browser's window:

panic% kill -USR2 PID

and watch this showing up in the error_log file:

caught SIGUSR2!
    at /home/httpd/perl/startup/startup.pl line 32
Book::StartUp::_ _ANON_ _('USR2') called 
    at /home/httpd/perl/debug/perl_trace.pl line 6
Apache::ROOT::perl::debug::perl_trace_2epl::hanging_sub( ) called 
    at /home/httpd/perl/debug/perl_trace.pl line 5
Apache::ROOT::perl::debug::perl_trace_2epl::handler('Apache=SCALAR(0x8309d08)') 
  called 
    at /usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm 
      line 140
eval {...} called 
    at /usr/lib/perl5/site_perl/5.6.1/i386-linux/Apache/Registry.pm
      line 140
Apache::Registry::handler('Apache=SCALAR(0x8309d08)') called 
    at PerlHandler subroutine `Apache::Registry::handler' line 0
eval {...} called 
    at PerlHandler subroutine `Apache::Registry::handler' line 0

We can clearly see that the process "hangs" in the code executed at line 6 of the /home/httpd/perl/debug/perl_trace.pl script, and it was called by the hanging_sub( ) routine defined at line 5.

21.7.4.2. Using the system calls trace

Let's write another similar mod_perl script that hangs, and deploy strace to find the point at which it hangs (see Example 21-11).

Example 21-11. hangme.pl

local $|=1;
my $r = shift;
$r->send_http_header('text/plain');

print "PID = $$\n";

my $i = 0;
while (1) {
    $i++;
    sleep 1;
}

The reason this simple code hangs is obvious. It never breaks from the while loop. As you can see, it prints the PID of the current process to the browser. Of course, in a real situation you cannot use the same trick—in the previous section we presented several ways to detect the runaway processes and their PIDs.

We save the above code in a file and make a request. As usual, we use $|=1; in our demonstration scripts to unbuffer STDOUT so we will immediately see the process ID. Once the script is requested, the script prints the PID and obviously hangs. So we press the Stop button, but the process continues to hang in this code. Isn't Apache supposed to detect the broken connection and abort the request? Yes and no—you will understand soon what's really happening.

First let's attach to the process and see what it's doing. We use the PID the script printed to the browser—in this case, it is 10045:

panic% strace -p 10045

[...truncated identical output...]
SYS_175(0, 0xbffff41c, 0xbffff39c, 0x8, 0) = 0
SYS_174(0x11, 0, 0xbffff1a0, 0x8, 0x11) = 0
SYS_175(0x2, 0xbffff39c, 0, 0x8, 0x2)   = 0
nanosleep(0xbffff308, 0xbffff308, 0x401a61b4, 0xbffff308, 0xbffff41c) = 0
time([940973834])                       = 940973834
time([940973834])                       = 940973834
[...truncated the identical output...]

It isn't what we expected to see, is it? These are some system calls we don't see in our little example. What we actually see is how Perl translates our code into system calls. We know that our code hangs in this snippet:

while (1) {
    $i++;
    sleep 1;
}

so these must be the system calls that represent this loop, since they are printed repeatedly.

Usually the situation is different from the one we have shown. You first detect the hanging process, then you attach to it and watch the trace of calls it does (or observe the last few system calls if the process is hanging waiting for something, as when blocking on a file-lock request). From watching the trace you figure out what it's actually doing, and probably find the corresponding lines in your Perl code. For example, let's see how one process hangs while requesting an exclusive lock on a file that is exclusively locked by another process (see Example 21-12).

Example 21-12. excl_lock.pl

use Fcntl qw(:flock);
use Symbol;

fork( ); # child and parent do the same operation

my $fh = gensym;
open $fh, ">/tmp/lock" or die "cannot open /tmp/lock: $!";
print "$$: I'm going to obtain the lock\n";
flock $fh, LOCK_EX;
print "$$: I've got the lock\n";
sleep 30;
close $fh;

The code is simple. The process executing the code forks a second process, and both do the same thing: generate a unique symbol to be used as a file handle, open the lock file for writing using the generated symbol, lock the file in exclusive mode, sleep for 30 seconds (pretending to do some lengthy operation), and close the lock file, which also unlocks the file.

The gensym function is imported from the Symbol module. The Fcntl module provides us with a symbolic constant, LOCK_EX. This is imported via the :flock tag, which imports this and other flock( ) constants.

The code used by both processes is identical, so we cannot predict which one will get its hands on the lock file and succeed in locking it first. Thus, we add print( )statements to find the PID of the process blocking (waiting to get the lock) on a lock request.

When the above code is executed from the command line, we see that one of the processes gets the lock:

panic% perl ./excl_lock.pl

3038: I'm going to obtain the lock
3038: I've got the lock
3037: I'm going to obtain the lock

Here we see that process 3037 is blocking, so we attach to it:

panic% strace -p 3037

about to attach c10
flock(3, LOCK_EX

It's clear from the above trace that the process is waiting for an exclusive lock. The missing closing parenthesis is not a typo; it means that strace didn't yet receive a return status from the call.

After spending time watching the running traces of different scripts, you will learn to more easily recognize what Perl code is being executed.

21.7.4.3. Using the interactive debugger

Another way to see a trace of the running code is to use a debugger such as gdb (the GNU debugger). It's supposed to work on any platform that supports the GNU development tools. Its purpose is to allow you to see what is going on inside a program while it executes, or what it was doing at the moment it failed.

To trace the execution of a process, gdb needs to know the PID and the path to the binary that the process is executing. For Perl code, it's /usr/bin/perl (or whatever the path to your Perl is). For httpd processes, it's the path to your httpd executable—often the binary is called httpd, but there's really no standard location for it.

Here are a few examples using gdb. First, let's go back to our last locking example, execute it as before, and attach to the process that didn't get the lock:

panic% gdb /usr/bin/perl 3037

After starting the debugger, we execute the where command to see the trace:

(gdb) where
#0  0x40209791 in flock ( ) from /lib/libc.so.6
#1  0x400e8dc9 in Perl_pp_flock ( ) at pp_sys.c:2033
#2  0x40099c56 in Perl_runops_debug ( ) at run.c:53
#3  0x4003118c in S_run_body (oldscope=1) at perl.c:1443
#4  0x40030c7e in perl_run (my_perl=0x804bf00) at perl.c:1365
#5  0x804953e in main (argc=3, argv=0xbffffac4, env=0xbffffad4)
    at perlmain.c:52
#6  0x4018bcbe in _ _libc_start_main ( ) from /lib/libc.so.6

That's not what we may have expected to see (i.e., a Perl stack trace). And now it's a different trace from the one we saw when we were using strace. Here we see the current state of the call stack, with main( ) at the bottom of the stack and flock( ) at the top.

We have to find out the place the code was called from—it's possible that the code calls flock( ) in several places, and we won't be able to locate the place in the code where the actual problem occurs without having this information. Therefore, we again use the curinfo macro after loading it from the .gdbinit file:

(gdb) source /usr/src/httpd_perl/mod_perl-1.25/.gdbinit
(gdb) curinfo
9:/home/httpd/perl/excl_lock.pl

As we can see, the program was stuck at line 9 of /home/httpd/perl/excl_lock.pl and that's the place to look at to resolve the problem.

When you attach to a running process with gdb, the program stops executing and control of the program is passed to the debugger. You can continue the normal program run with the continue command or execute it step by step with the next and step commands, which you type at the gdb prompt. (next steps over any function calls in the source, while step steps into them.)

The use of C/C++ debuggers is a large topic, beyond the scope of this book. The gdb man and info pages are quite good. You might also want to check ddd (the Data Display Debugger), which provides a visual interface to gdb and other debuggers. It even knows how to debug Perl programs.

For completeness, let's see the gdb trace of the httpd process that's hanging in the while(1) loop of the first example in this section:

panic% gdb /home/httpd/httpd_perl/bin/httpd 1005

(gdb) where
#0  0x402251c1 in nanosleep ( ) from /lib/libc.so.6
#1  0x40225158 in sleep ( ) from /lib/libc.so.6
#2  0x4014d3a6 in Perl_pp_sleep ( ) at pp_sys.c:4187
#3  0x400f5c56 in Perl_runops_debug ( ) at run.c:53
#4  0x4008e088 in S_call_body (myop=0xbffff688, is_eval=0) at perl.c:1796
#5  0x4008dc4f in perl_call_sv (sv=0x82fc75c, flags=4) at perl.c:1714
#6  0x807350e in perl_call_handler (sv=0x82fc75c, r=0x8309eec, args=0x0)
    at mod_perl.c:1677
#7  0x80729cd in perl_run_stacked_handlers (hook=0x80d0db9 "PerlHandler", 
    r=0x8309eec, handlers=0x82e9b64) at mod_perl.c:1396
#8  0x80701b4 in perl_handler (r=0x8309eec) at mod_perl.c:922
#9  0x809f409 in ap_invoke_handler (r=0x8309eec) at http_config.c:517
#10 0x80b3e8f in process_request_internal (r=0x8309eec) at http_request.c:1286
#11 0x80b3efa in ap_process_request (r=0x8309eec) at http_request.c:1302
#12 0x80aae60 in child_main (child_num_arg=0) at http_main.c:4205
#13 0x80ab0e8 in make_child (s=0x80eea54, slot=0, now=981621024)
    at http_main.c:4364
#14 0x80ab19c in startup_children (number_to_start=3) at http_main.c:4391
#15 0x80ab80c in standalone_main (argc=1, argv=0xbffff9e4) at http_main.c:4679
#16 0x80ac03c in main (argc=1, argv=0xbffff9e4) at http_main.c:5006
#17 0x401bbcbe in _ _libc_start_main ( ) from /lib/libc.so.6

As before, we can see a complete trace of the last executed call. To see the line the program hangs, we use curinfo again:

(gdb) source /usr/src/httpd_perl/mod_perl-1.25/.gdbinit
(gdb) curinfo
9:/home/httpd/perl/hangme.pl

Indeed, the program spends most of its time at line 9:

7 : while (1) {
8 :     $i++;
9 :    sleep 1;
10: }

Since while( ) and $i++ are executed very fast, it's almost impossible to catch Perl running either of these instructions.