Apache's mod_proxy Module (Practical mod_perl)

12.7.1. Concepts and Configuration Directives

In the following explanation, we will use www.example.com as the main server users access when they want to get some kind of service and backend.example.com as the machine that does the heavy work. The main and backend servers are different; they may or may not coexist on the same machine.

We'll use the mod_proxy module built into the main server to handle requests to www.example.com. For the sake of this discussion it doesn't matter what functionality is built into the backend.example.com server—obviously it'll be mod_perl for most of us, but this technique can be successfully applied to other web programming languages (PHP, Java, etc.).

12.7.1.1. ProxyPass

You can use the ProxyPass configuration directive to map remote hosts into the URL space of the local server; the local server does not act as a proxy in the conventional sense, but appears to be a mirror of the remote server.

Let's explore what this rule does:

ProxyPass   /perl/ http://backend.example.com/perl/

When a user initiates a request to http://www.example.com/perl/foo.pl, the request is picked up by mod_proxy. It issues a request for http://backend.example.com/perl/foo.pl and forwards the response to the client. This reverse proxy process is mostly transparent to the client, as long as the response data does not contain absolute URLs.

One such situation occurs when the backend server issues a redirect. The URL to redirect to is provided in a Location header in the response. The backend server will use its own ServerName and Port to build the URL to redirect to. For example, mod_dir will redirect a request for http://www.example.com/somedir/ to http://backend.example.com/somedir/ by issuing a redirect with the following header:

Location: http://backend.example.com/somedir/

Since ProxyPass forwards the response unchanged to the client, the user will see http://backend.example.com/somedir/ in her browser's location window, instead of http://www.example.com/somedir/.

You have probably noticed many examples of this from real-life web sites you've visited. Free email service providers and other similar heavy online services display the login or the main page from their main server, and then when you log in you see something like x11.example.com, then w59.example.com, etc. These are the backend servers that do the actual work.

Obviously this is not an ideal solution, but since users don't usually care about what they see in the location window, you can sometimes get away with this approach. In the following section we show a better solution that solves this issue and provides even more useful functionalities.

12.7.1.2. ProxyPassReverse

This directive lets Apache adjust the URL in the Location header on HTTP redirect responses. This is essential when Apache is used as a reverse proxy to avoid bypassing the reverse proxy because of HTTP redirects on the backend servers. It is generally used in conjunction with the ProxyPass directive to build a complete frontend proxy server.

ProxyPass          /perl/  http://backend.example.com/perl/
ProxyPassReverse   /perl/  http://backend.example.com/perl/

When a user initiates a request to http://www.example.com/perl/foo, the request is proxied to http://backend.example.com/perl/foo. Let's say the backend server responds by issuing a redirect for http://backend.example.com/perl/foo/ (adding a trailing slash). The response will include a Location header:

Location: http://backend.example.com/perl/foo/

ProxyPassReverse on the frontend server will rewrite this header to:

Location: http://www.example.com/perl/foo/

This happens completely transparently. The end user is never aware of the URL rewrites happening behind the scenes.

Note that this ProxyPassReverse directive can also be used in conjunction with the proxy pass-through feature of mod_rewrite, described later in this chapter.

12.7.1.3. Security issues

Whenever you use mod_proxy you need to make sure that your server will not become a proxy for freeriders. Allowing clients to issue proxy requests is controlled by the ProxyRequests directive. Its default setting is Off, which means proxy requests are handled only if generated internally (by ProxyPass or RewriteRule...[P] directives). Do not use the ProxyRequests directive on your reverse proxy servers.

12.7.2. Knowing the Proxypassed Connection Type

Let's say that you have a frontend server running mod_ssl, mod_rewrite, and mod_proxy. You want to make sure that your user is using a secure connection for some specific actions, such as login information submission. You don't want to let the user log in unless the request was submitted through a secure port.

Since you have to proxypass the request between the frontend and backend servers, you cannot know where the connection originated. The HTTP headers cannot reliably provide this information.

A possible solution for this problem is to have the mod_perl server listen on two different ports (e.g., 8000 and 8001) and have the mod_rewrite proxy rule in the regular server redirect to port 8000 and the mod_rewrite proxy rule in the SSL virtual host redirect to port 8001. Under the mod_perl server, use $r->connection->port or the environment variable PORT to tell if the connection is secure.

12.7.3. Buffering Feature

In addition to correcting the URI on its way back from the backend server, mod_proxy, like Squid, also provides buffering services that benefit mod_perl and similar heavy modules. The buffering feature allows mod_perl to pass the generated data to mod_proxy and move on to serve new requests, instead of waiting for a possibly slow client to receive all the data.

Figure 12-7 depicts this feature.

Figure 12-7. mod_proxy buffering

mod_perl streams the generated response into the kernel send buffer, which in turn goes into the kernel receive buffer of mod_proxy via the TCP/IP connection. mod_proxy then streams the file into the kernel send buffer, and the data goes to the client over the TCP/IP connection. There are four buffers between mod_perl and the client: two kernel send buffers, one receive buffer, and finally the mod_proxy user space buffer. Each of those buffers will take the data from the previous stage, as long as the buffer is not full. Now it's clear that in order to immediately release the mod_perl process, the generated response should fit into these four buffers.

If the data doesn't fit immediately into all buffers, mod_perl will wait until the first kernel buffer is emptied partially or completely (depending on the OS implementation) and then place more data into it. mod_perl will repeat this process until the last byte has been placed into the buffer.

The kernel's receive buffers (recvbuf) and send buffers (sendbuf) are used for different things: the receive buffers are for TCP data that hasn't been read by the application yet, and the send buffers are for application data that hasn't been sent over the network yet. The kernel buffers actually seem smaller than their declared size, because not everything goes to actual TCP/IP data. For example, if the size of the buffer is 64 KB, only about 55 KB or so can actually be used for data. Of course, the overhead varies from OS to OS.

It might not be a very good idea to increase the kernel's receive buffer too much, because you could just as easily increase mod_proxy's user space buffer size and get the same effect in terms of buffering capacity. Kernel memory is pinned (not swappable), so it's harder on the system to use a lot of it.

The user space buffer size for mod_proxy seems to be fixed at 8 KB, but changing it is just a matter of replacing HUGE_STRING_LEN with something else in src/modules/proxy/proxy_http.c under the Apache source distribution.

mod_proxy's receive buffer is configurable by the ProxyReceiveBufferSize parameter. For example:

ProxyReceiveBufferSize 16384

will create a buffer 16 KB in size. ProxyReceiveBufferSize must be bigger than or equal to 512 bytes. If it's not set or is set to 0, the system default will be used. The number it's set to should be an integral multiple of 512. ProxyReceiveBufferSize cannot be bigger than the kernel receive buffer size; if you set the value of ProxyReceiveBufferSize larger than this size, the default value will be used (a warning will be printed in this case by mod_proxy).

You can modify the source code to adjust the size of the server's internal read-write buffers by changing the definition of IOBUFSIZE in include/httpd.h.

Unfortunately, you cannot set the kernel buffers' sizes as large as you might want because there is a limit to the available physical memory and OSes have their own upper limits on the possible buffer size. To increase the physical memory limits, you have to add more RAM. You can change the OS limits as well, but these procedures are very specific to OSes. Here are some of the OSes and the procedures to increase their socket buffer sizes:

Linux

FreeBSD

This buffering technique applies only to downstream data (data coming from the origin server to the proxy), not to upstream data. When the server gets an incoming stream, because a request has been issued, the first bits of data hit the mod_perl server immediately. Afterward, if the request includes a lot of data (e.g., a big POST request, usually a file upload) and the client has a slow connection, the mod_perl process will stay tied, waiting for all the data to come in (unless it decides to abort the request for some reason). Falling back on mod_cgi seems to be the best solution for specific scripts whose major function is receiving large amounts of upstream data. Another alternative is to use yet another mod_perl server, which will be dedicated to file uploads only, and have it serve those specific URIs through correct proxy configuration.

12.7.4. Closing Lingering Connections with lingerd

Because of some technical complications in TCP/IP, at the end of each client connection, it is not enough for Apache to close the socket and forget about it; instead, it needs to spend about one second lingering (waiting) on the client.[43]

[43]More details can be found at http://httpd.apache.org/docs/misc/fin_wait_2.html.

lingerd is a daemon (service) designed to take over the job of properly closing network connections from an HTTP server such as Apache and immediately freeing it to handle new connections.

lingerd can do an effective job only if HTTP KeepAlives are turned off. Since Keep-Alives are useful for images, the recommended setup is to serve dynamic content with mod_perl-enabled Apache and lingerd, and static content with plain Apache.

With a lingerdsetup, we don't have the proxy (we don't want to use lingerd on our httpd_docs server, which is also our proxy), so the buffering chain we presented earlier for the proxy setup is much shorter here (see Figure 12-8).

Figure 12-8. Shorter buffering chain

Hence, in this setup it becomes more important to have a big enough kernel send buffer.

With lingerd, a big enough kernel send buffer, and KeepAlives off, the job of spoonfeeding the data to a slow client is done by the OS kernel in the background. As a result, lingerd makes it possible to serve the same load using considerably fewer Apache processes. This translates into a reduced load on the server. It can be used as an alternative to the proxy setups we have seen so far.

For more information about lingerd, see http://www.iagora.com/about/software/lingerd/.

12.7.5. Caching Feature

Apache does caching as well. It's relevant to mod_perl only if you produce proper headers, so your scripts' output can be cached. See the Apache documentation for more details on the configuration of this capability.

To enable caching, use the CacheRoot directive, specifying the directory where cache files are to be saved:

CacheRoot /usr/local/apache/cache

Make sure that directory is writable by the user under which httpd is running.

The CacheSize directive sets the desired space usage in kilobytes:

CacheSize 50000   # 50 MB

Garbage collection, which enforces the cache size, is set in hours by the CacheGcInterval. If unspecified, the cache size will grow until disk space runs out. This setting tells mod_proxy to check that your cache doesn't exceed the maximum size every hour:

CacheGcInterval 1

CacheMaxExpirespecifies the maximum number of hours for which cached documents will be retained without checking the origin server:

CacheMaxExpire 72

If the origin server for a document did not send an expiry date in the form of an Expires header, then the CacheLastModifiedFactor will be used to estimate one by multiplying the factor by the time the document was last modified, as supplied in the Last-Modified header.

CacheLastModifiedFactor 0.1

If the content was modified 10 hours ago, mod_proxy will assume an expiration time of 10 × 0.1 = 1 hour. You should set this according to how often your content is updated.

If neither Last-Modified nor Expires is present, the CacheDefaultExpire directive specifies the number of hours until the document is expired from the cache: