Logs are the pulse of your web server -- the rhythms produced by the comings and goings of your visitors. In this column I'll give you a gentle introduction to Apache web server logs and their place in monitoring, security, marketing, and feedback.
Before you go running for the hills, I won't be talking about those mathematical logarithms that gave you a headache in high school. Your web server records visits to your web site in the form of logs, a text file (or files) containing entries corresponding to each request (or "hit"). At first glance, logs may look convoluted, but they're actually quite simple. Once you're familiar with the notation, you'll be reading your logs as easily as your daily journal.
Before we dive in, let's get our terminology straight.
Hit
When the Web was young, people measured their web sites' effectiveness
in terms of hits. A hit is a request made of a web server.
A request may correspond to an HTML page, an image, a CGI script, or
any other type of file or interactive content. The important thing to
remember is that each and every request counts as a hit. Therefore,
when a visitor requests a web page containing three embedded images,
the log tallies up four hits.
Hit counts lost their effectiveness as people began to add gratuitous images to their pages in order to inflate their sites' perceived popularity. Hit counts are, however, useful to server administrators as a simplistic traffic or server-utilization indicator.
Page View
A page view, as you probably guessed, is a hit in which only the
page (and not its embedded elements) is counted. Each visit to an HTML
document, whether or not it's crammed full of animated images, sounds,
and Java applets, is counted as a single page view.
Content providers track page view counts to figure out which content is most interesting to their audience. For example, say an article on Internet marketing generated 1,024 page views, whereas another on door-to-door sales generated only 42. One could reasonably guess the site's audience is far more interested in marketing than sales (at least the door-to-door kind).
As another example, let's assume my article is spread across four pages with "next page" links at the bottom of the first three. A particularly telling page view spread would be: Page 1 (456 views), Page 2 (345 views), Page 3 (93 views), and Page 4 (12 views). I would conclude that my audience, while interested in the topic overall, lost interest in my article somewhere in the second page.
Marketeers like to use page view counts as popularity indicators. But the assumption that each page view equals a unique person is almost certainly incorrect. For example, 100 page views could either signify 100 people visiting the page once, or one person visiting the page 100 times.
Unique Host vs. Unique Visitor
Your web logs, with a little massaging, can tell you the number of
unique hosts (or computers) that have paid you a visit. This provides a
smidge more information than straight page views, but there's a problem
here, too.
A visit from a unique host doesn't necessarily equal a visit from a unique visitor. Perhaps the host in question is a computer sitting in a public library; through the course of a day, several users of that computer may visit the same site or even the same page (think Yahoo).
Then there's the issue of the dynamic host. When you dial into your Internet service provider (ISP) via modem, your computer's unique identifier (IP address) is, in all probablity, assigned dynamically. If you hang up and dial in again, there's no guarantee that you'll receive the same identifier. So, what looks like a unique host in your log file may actually be several visitors who just happen to have been allocated the same IP address at different times.
The bottom line is this: hits, page views, and host visits only give you a general picture of your web site's visitors and traffic patterns. The generally agreed upon way to properly tag and track a unique user is to use cookies (or "magic cookies"), snippets of identifying information that are sent right along with the user's request and server's response.
For more information about cookies, visit the Resources section at the end of this article.
Impression
An impression is the almost same thing as a page view -- the
difference lies in what's being viewed. While a visit to a web page
is generally referred to as a page view, an impression usually refers
to viewing advertisements such as banners, animated mini-commercials,
buttons, and the like.
Let's see what's lurking inside that log.
For the purposes of this look at a typical set of logs, I'm
assuming your Apache server has been configured to use
Common Log Format (CLF), the default in a fresh Apache installation. Your
httpd.conf
file
should contain the following configuration directive:
CustomLog logs/access_log common
Look at your access log, the location of which will
depend upon your layout preferences and installation method.
The Apache 1.3.9 RPM installation under Red Hat 6.1 places logs in an
/etc/httpd/logs
directory. The source and binary installs
typically use /usr/local/apache/logs/access_log
. The default
filename under Windows is access.log
.
Let's zoom in on one fairly representative line in a log:
123.45.678.90 - - [07/Mar/2000:14:27:12 -0800]
"GET /mypage.html HTTP/1.1" 200 10369
123.45.678.90
|
The visitor's IP address. If you particularly need the visitor's host name, read the Apache documentation on the HostNameLookups directive. |
- -
|
The first of the two dashes is a placeholder for something called ident, a less trustworthy form of client identification. That's about all I'll say on this; for further information, see Apache's IdentityCheck directive. The second dash is a placeholder for the user name supplied
by a visitor if required to log in to gain access to a password-protected
section of the web site. Say, for example, I restricted access
to a |
[07/Mar/2000:14:27:12 -0800]
|
The date, time, and time-zone. |
GET /mypage.html
|
The visitor's request, in this case the You'll often
see requests consisting only of a slash, |
HTTP/1.1
|
The browser's request protocol, in this case HTTP, version 1.1. An older, yet still very common protocol, is HTTP 1.0. |
200
|
An
HTTP status code is
returned as part of the response to the visitor's browser.
|
10369
|
The number of bytes returned to the visitor, excluding headers
(status codes and the like). In the case of a 304
Not Modified status (see above), this value is the usual
|
Logging in Apache (version 1.2 and later) is handled by the Apache module,
mod_log_config,
which enables you to customize how your logs look and work. Your
httpd.conf
file contains some popular log formats to get you started:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"
\"%{User-Agent}i\"" combined
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent
Each log format starts out with the LogFormat directive, followed by a string of tokens that describe how each line of the log file should look, and ending with a nickname given to the format. Click here for a comprehensive list of tokens and their meanings. How you want your logs displayed and into how many files you want them sorted is up to you. Some site authors separate log files into referrer and agent logs. I prefer to use the "combined" log format and keep everything in one place.
Let's say I wish to use "common" log format, but also want to keep track
of who is linking to my site. I could just use "combined" format, but
I don't really care what type of browser (agent) my visitor is using.
Instead, I'll create a new LogFormat
directive like so:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\"" commonish
Now that I've defined my preferred log format, I need to tell Apache to use this format. Using my "commonish" log format above:
CustomLog logs/commonish_log commonish
where logs/commonish_log
is the path to my log file relative
to my ServerRoot.
You can actually skip the LogFormat
directive and include
your preferred log format string in place of the nickname in your
CustomLog
directive -- it's up to you.
We've only just scratched the surface of log customization. For much more, be sure to read the detailed mod_log_config documentation.
In addition to access logs, Apache notes unusual server activity
in an error log. In your httpd.conf
file, the
ErrorLog
and LogLevel
directives pertain to
the error log. They should look something like this:
ErrorLog logs/error_log
LogLevel warn
The first line tells Apache where to log errors. The second line sets the
threshold for what types of errors to log. The default warn
level should be just fine; for a list of log levels, consult the
Apache LogLevel documentation.
The contents of the Apache error log are pretty clear. For example,
someone requesting an HTML document, nonexistent.html
, that
does not exist on the server, generates the following error log entry:
[Tue Mar 07 09:59:29 2000] [error] [client 123.45.678.90]
File does not exist: /path/to/htdocs/nonexistent.html
Restarting your Apache server generates:
[Tue Mar 07 09:52:40 2000] [notice] SIGHUP received.
Attempting to restart
[Tue Mar 07 09:52:42 2000] [notice] Apache/1.3.11 (Unix) configured
-- resuming normal operations
The error log is a very useful tool for:
LogLevel
to suit your particular
needs.The following is a list of starting points from which to explore further some of the topics covered in this article.
Apache and mod_perl, RPM-Style.
Return to Related Articles from the O'Reilly Network .
oreillynet.com Copyright © 2003 O'Reilly & Associates, Inc.