Stopping Spam with SpamAssassin
I receive a lot of spam; an absolute massive bucket load of spam. I received more than 100 pieces of spam in the first three days of this month. I receive so much spam that Hormel Foods sends trucks to take it away. And I'm convinced that things are getting worse. We're all being bombarded with junk mail more than ever these days.
Well, a couple of days ago, I reached my breaking point, and decided that the simple mail filtering I had in place up until now just wasn't up to the job. It was time to call in an assassin.
SpamAssassin is a rule-based
spam identification tool. It's written in Perl, and there are several
ways of using it: You can call a client program,
spamassassin
, and have it determine whether a given
message is likely to be spam; you can do essentially the same thing but
use a client/server approach so that your client isn't always loading
and parsing the rules each time mail comes; or, finally, you can use
a Perl module interface to filter spam from a Perl program.
SpamAssassin is extremely configurable; you can select which rules you want to use, change the way the rules contribute to a piece of mail's "spam score," and add your own rules. We'll look at some of these features later in the article. First, how do we get SpamAssassin installed and start using it?
If you're using Debian Linux or one of the BSDs, then this couldn't be
easier: just install the appropriate package using apt
or
the ports tree respectively. (The BSD port is called
p5-Mail-SpamAssassin
)
Those less fortunate will have to download the latest version of SpamAssassin, and install it themselves.
SpamAssassin uses a variety of ways for testing whether an e-mail is spam, ranging from simple textual checks on the headers or body and detecting missing or misleading headers to network-based checks and an interesting distributed system called Vipul's Razor.
Vipul's Razor takes advantage of the fact that spam is, by its nature, distributed in bulk. Hence, a lot of the spam that you see, I'm also going to see at some point. If there were a big clearing-house where you could report spam and I could see if my incoming mail matches what you've already reported, then I could have a guaranteed way of determining whether a given mail is spam. Vipul's Razor is that clearing-house.
Why is it a Razor? Because it's a collaborative system, its strength is directly derived from the quality of its database, which comes back to the way it's used by the likes of you and me. If end-users report lots of real spam, the Razor gets better; if the database gets "poisoned" by lots of false or misleading reports, then the efficiency of the whole system drops.
Just like any other spam detection mechanism, Razor isn't perfect. There are two points particularly worth noting. First, while it tries to completely avoid false positives (saying something's spam when it isn't) by requiring that spam be reported, it doesn't do anything about false negatives (saying something's not spam when it is) because it only knows about the mail in its database.
Second, spammers, like all other primitive organisms, are constantly evolving. Vipul's Razor only works for spam that is delivered in bulk without modification. Spam that is "personalized" by the addition of random spaces, letters or the name of the recipient, will produce a different signature that won't match similar spam messages in the Razor database.
Nevertheless, the Razor is an excellent addition to the spam fighter's
arsenal, since when it marks something as spam, you can be almost
positive it's correct. And just like SpamAssassin, it's all pure Perl.
Mail::Audit
has long supported a Razor plugin, but now we
can move to calling Razor as part of a more comprehensive mail filtering
system based on SpamAssasin and Mail::Audit
Installing Vipul's Razor is similar to installing SpamAssassin. Debian and BSD users have packages called "razor" and "razor-clients," respectively; and the rest of the world can download and install from the home page. SpamAssassin will detect whether Razor is available and, by default, use it if so.
So this is the part you've all been waiting for. How do we use these
things to trap spam? For those of you who aren't familiar with
Mail::Audit
, the idea is simple: just like with
procmail
, you write recipes that determine what happens to
your mail. However, in the case of Mail::Audit
, you specify
the recipe in Perl. For instance, here's a recipe to move all mail sent
to [email protected]
to another folder:
use Mail::Audit;
my $mail = Mail::Audit->new();
if ($mail->from =~ /perl5-porters\@perl.org/) {
$mail->accept("p5p");
}
$mail->accept();
For more details on how to construct mail filters with
Mail::Audit
, see my
previous
article.
Plugging SpamAssassin into your filters couldn't be simpler. First of
all, you absolutely need the latest version of Mail::Audit
,
version 2.1 from CPAN. Nothing
earlier will do! Now write a filter like this:
use Mail::Audit;
use Mail::SpamAssassin;
my $mail = Mail::Audit->new();
... the rest of your rules here ...
my $spamtest = Mail::SpamAssassin->new();
my $status = $spamtest->check($mail);
if ($status->is_spam ()) {
$status->rewrite_mail() };
$mail->accept("spam");
}
$mail->accept();
As you might be able to guess, the important thing here is the calls to
check
and is_spam
. check
produces
a "status object" that we can query and use to manipulate the e-mail.
is_spam
tells us whether the mail has exceeded the number of
"spam points" required to flag an e-mail as spam.
The
rewrite_mail
method adds some headers and rewrites the
subject line to include the distinctive string "*****SPAM******". The
additional headers explain why the e-mail was flagged as spam. For
instance:
X-Spam-Status: Yes, hits=6.1 required=5.0
tests=SUBJ_HAS_Q_MARK,REPLY_TO_EMPTY,SUBJ_ENDS_IN_Q_MARK version=2.1
This message had a question mark in the subject, an empty reply-to, and
the subject ended in a question mark. The mail wasn't actually spam, but
this goes to prove that the technique isn't perfect. Nevertheless, since
installing the spam filter, I've only seen about 10 false positives,
and zero false negatives. I'm happy enough with this solution.
One important point to remember, however, is where in the course of your filtering you should call SpamAssassin's checks. For instance, you want to do so after your mailing list filtering, because mail sent to mailing lists may have munged headers that might confuse SpamAssassin. However, this means that spam sent to mailing lists might slip through the net. Experiment, and find the best solution for your own e-mail patterns.
Of course, there are times when it might not be suitable to use
Mail::Audit
or you may not want to. Since SpamAssassin is
provided as a command line tool as well as a set of Perl modules, it's
easy enough to integrate it in whatever mail filtering solution you use.
For instance, here's a procmail recipe that calls out to
spamassassin
to filter out spam:
:0fw
| spamassassin -P
:0:
* ^X-Spam-Status: Yes
spambox
For the speed-conscious, you can run the spamd
daemon and
replace calls to spamassassin
with spamc
; be
aware that this is a TCP/IP daemon that you may want to firewall
from the rest of the world.
Another approach is to call spamassassin
in your mail
transport agent, meaning that spam is filtered out before it even
attempts to be delivered to you. There's a Sendmail milter library
available that allows you to use SpamAssassin, and similar tricks for Exim
and other MTAs are available.
The Mail::SpamAssassin
module has many other methods
you can use to manipulate e-mail. For instance, if you've identified
something as definitely being spam, then you can use
$spamtest->report_as_spam($mail);
to report it to Vipul's Razor. (Take note of this: As we've mentioned
above, the efficiency of the Razor database comes from the fact that
e-mails in it are confirmed as spam by a human. Adding false positives to
the database would degrade its usefulness for everyone. Only submit mail
that you've confirmed personally.)
If you're finding that mail checking is taking too long because SpamAssassin is having to contact the various network-based blacklists and databases, then you can instruct it to only perform "local" checking:
$spamtest = Mail::SpamAssassin->new({local_tests_only => 1});
There is a wealth of other options available. See the
Mail::SpamAssassin
documentation for more details, and
happy assassinating!
Return to Related Articles from the O'Reilly Network .
Perl.com Compilation Copyright © 1998-2003 O'Reilly & Associates, Inc.