O'Reilly Network: Log Rhythms (O'Reilly Network)

start page | rating of books | rating of authors | reviews | copyrights

    Published on the O'Reilly Network (http://www.oreillynet.com/)
   http://www.oreillynet.com/pub/a/apache/2000/03/10/log_rhythms.html
   See this if you're having trouble printing code examples

Log Rhythms

by Rael Dornfest
03/10/2000

Logs are the pulse of your web server -- the rhythms produced by the comings and goings of your visitors. In this column I'll give you a gentle introduction to Apache web server logs and their place in monitoring, security, marketing, and feedback.

Before you go running for the hills, I won't be talking about those mathematical logarithms that gave you a headache in high school. Your web server records visits to your web site in the form of logs, a text file (or files) containing entries corresponding to each request (or "hit"). At first glance, logs may look convoluted, but they're actually quite simple. Once you're familiar with the notation, you'll be reading your logs as easily as your daily journal.

"One Hit Wonder" or Lasting Impression?

Before we dive in, let's get our terminology straight.

Hit
When the Web was young, people measured their web sites' effectiveness in terms of hits. A hit is a request made of a web server. A request may correspond to an HTML page, an image, a CGI script, or any other type of file or interactive content. The important thing to remember is that each and every request counts as a hit. Therefore, when a visitor requests a web page containing three embedded images, the log tallies up four hits.

Hit counts lost their effectiveness as people began to add gratuitous images to their pages in order to inflate their sites' perceived popularity. Hit counts are, however, useful to server administrators as a simplistic traffic or server-utilization indicator.

Page View
A page view, as you probably guessed, is a hit in which only the page (and not its embedded elements) is counted. Each visit to an HTML document, whether or not it's crammed full of animated images, sounds, and Java applets, is counted as a single page view.

Content providers track page view counts to figure out which content is most interesting to their audience. For example, say an article on Internet marketing generated 1,024 page views, whereas another on door-to-door sales generated only 42. One could reasonably guess the site's audience is far more interested in marketing than sales (at least the door-to-door kind).

As another example, let's assume my article is spread across four pages with "next page" links at the bottom of the first three. A particularly telling page view spread would be: Page 1 (456 views), Page 2 (345 views), Page 3 (93 views), and Page 4 (12 views). I would conclude that my audience, while interested in the topic overall, lost interest in my article somewhere in the second page.

Marketeers like to use page view counts as popularity indicators. But the assumption that each page view equals a unique person is almost certainly incorrect. For example, 100 page views could either signify 100 people visiting the page once, or one person visiting the page 100 times.

Unique Host vs. Unique Visitor
Your web logs, with a little massaging, can tell you the number of unique hosts (or computers) that have paid you a visit. This provides a smidge more information than straight page views, but there's a problem here, too.

A visit from a unique host doesn't necessarily equal a visit from a unique visitor. Perhaps the host in question is a computer sitting in a public library; through the course of a day, several users of that computer may visit the same site or even the same page (think Yahoo).

Then there's the issue of the dynamic host. When you dial into your Internet service provider (ISP) via modem, your computer's unique identifier (IP address) is, in all probablity, assigned dynamically. If you hang up and dial in again, there's no guarantee that you'll receive the same identifier. So, what looks like a unique host in your log file may actually be several visitors who just happen to have been allocated the same IP address at different times.

The bottom line is this: hits, page views, and host visits only give you a general picture of your web site's visitors and traffic patterns. The generally agreed upon way to properly tag and track a unique user is to use cookies (or "magic cookies"), snippets of identifying information that are sent right along with the user's request and server's response.

For more information about cookies, visit the Resources section at the end of this article.

Impression
An impression is the almost same thing as a page view -- the difference lies in what's being viewed. While a visit to a web page is generally referred to as a page view, an impression usually refers to viewing advertisements such as banners, animated mini-commercials, buttons, and the like.

The Access Log

Let's see what's lurking inside that log. For the purposes of this look at a typical set of logs, I'm assuming your Apache server has been configured to use Common Log Format (CLF), the default in a fresh Apache installation. Your httpd.conf file should contain the following configuration directive:

CustomLog logs/access_log common

Look at your access log, the location of which will depend upon your layout preferences and installation method. The Apache 1.3.9 RPM installation under Red Hat 6.1 places logs in an /etc/httpd/logs directory. The source and binary installs typically use /usr/local/apache/logs/access_log. The default filename under Windows is access.log.

Let's zoom in on one fairly representative line in a log:

123.45.678.90 - - [07/Mar/2000:14:27:12 -0800] 
"GET /mypage.html HTTP/1.1" 200 10369

`123.45.678.90`	The visitor's IP address. If you particularly need the visitor's host name, read the Apache documentation on the HostNameLookups directive.
`- -`	The first of the two dashes is a placeholder for something called ident, a less trustworthy form of client identification. That's about all I'll say on this; for further information, see Apache's IdentityCheck directive. The second dash is a placeholder for the user name supplied by a visitor if required to log in to gain access to a password-protected section of the web site. Say, for example, I restricted access to a `private` directory on my server to only myself. Upon visiting `http://www.memyselfandi.net/private`, I'd have to log in (say, as the user "me") to gain access to that directory's contents. Thereafter, all my requests for items in that directory are logged, replacing the dash with `me`.
`[07/Mar/2000:14:27:12 -0800]`	The date, time, and time-zone.
`GET /mypage.html`	The visitor's request, in this case the `mypage.html` document in the web server's document root. You'll often see requests consisting only of a slash, `GET /`, or composed of a directory path and ending in only a slash, `GET /some/path/`. This denotes a request for the default document within the server's document root or along some directory path. So, if your default DirectoryIndex is `index.html`, every request for `/` results in the return of that directory's `index.html` document to the visitor's browser. If no DirectoryIndex document exists in the requested directory, the browser will display either a listing of the files in that directory or a "Forbidden" message, depending on your IndexOptions and FancyIndexing settings.
`HTTP/1.1`	The browser's request protocol, in this case HTTP, version 1.1. An older, yet still very common protocol, is HTTP 1.0.
`200`	An HTTP status code is returned as part of the response to the visitor's browser. `200` signifies "OK" -- request fulfilled. A common error you might have come across in your Web travels is "404 Not Found," indicating that the request does not match anything on the server. Also, a code of "304 Not Modified" says that the content has not changed since it was last requested. In other words, you've visited before and already have the latest copy of this content in your browser's cache, so the content is not resent for efficiency's sake.
`10369`	The number of bytes returned to the visitor, excluding headers (status codes and the like). In the case of a 304 Not Modified status (see above), this value is the usual `-` placeholder.

Logging in Apache (version 1.2 and later) is handled by the Apache module, mod_log_config, which enables you to customize how your logs look and work. Your httpd.conf file contains some popular log formats to get you started:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" 
\"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent

Each log format starts out with the LogFormat directive, followed by a string of tokens that describe how each line of the log file should look, and ending with a nickname given to the format. Click here for a comprehensive list of tokens and their meanings. How you want your logs displayed and into how many files you want them sorted is up to you. Some site authors separate log files into referrer and agent logs. I prefer to use the "combined" log format and keep everything in one place.

Let's say I wish to use "common" log format, but also want to keep track of who is linking to my site. I could just use "combined" format, but I don't really care what type of browser (agent) my visitor is using. Instead, I'll create a new LogFormat directive like so:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"" commonish

Now that I've defined my preferred log format, I need to tell Apache to use this format. Using my "commonish" log format above:

CustomLog logs/commonish_log commonish

where logs/commonish_log is the path to my log file relative to my ServerRoot. You can actually skip the LogFormat directive and include your preferred log format string in place of the nickname in your CustomLog directive -- it's up to you.

We've only just scratched the surface of log customization. For much more, be sure to read the detailed mod_log_config documentation.

Bump on a Log: The Error Log

In addition to access logs, Apache notes unusual server activity in an error log. In your httpd.conf file, the ErrorLog and LogLevel directives pertain to the error log. They should look something like this:

ErrorLog logs/error_log
LogLevel warn

The first line tells Apache where to log errors. The second line sets the threshold for what types of errors to log. The default warn level should be just fine; for a list of log levels, consult the Apache LogLevel documentation.

The contents of the Apache error log are pretty clear. For example, someone requesting an HTML document, nonexistent.html, that does not exist on the server, generates the following error log entry:

[Tue Mar 07 09:59:29 2000] [error] [client 123.45.678.90]
File does not exist: /path/to/htdocs/nonexistent.html

Restarting your Apache server generates:

[Tue Mar 07 09:52:40 2000] [notice] SIGHUP received.  
Attempting to restart
[Tue Mar 07 09:52:42 2000] [notice] Apache/1.3.11 (Unix) configured 
-- resuming normal operations

The error log is a very useful tool for:

Monitoring
Keep tabs on your server's activity and status. The key is tuning the LogLevel to suit your particular needs.
Security
The error logs are often your first indication that something is amiss. A sudden spate of "authentication failures" in password-protected directories may indicate someone trying to see what they're not allowed to.

Resources

The following is a list of starting points from which to explore further some of the topics covered in this article.

Apache Logging Modules
- mod_log_config (default)
- mod_log_agent (deprecated)
- mod_log_common (deprecated)
- mod_log_referer (deprecated)
Apache Log Configuration Directives
- CustomLog
- ErrorLog
- LogFormat
- LogLevel
- RefererLog (deprecated)
- ScriptLog
- ScriptLogBuffer
- ScriptLogLength
- TransferLog
User Tracking and Cookies
- Cookie Central
- mod_usertrack
Log Analysis Tools
- Yahoo Log Analysis Tools -- a huge tool drawer
- Analog - a personal favorite
- Webalizer, The - another popular analyzer

Tune in Next Time...

Apache and mod_perl, RPM-Style.