Stopping Spam at the Server: Part III
- The Safety (and Wisdom) of Combined Solutions
- You Know It's Spam, But What Do You Do With It?
- So How Does This Save Users Time?
- Making Quarantines Even Smarter
- Clean Your Own House
In the second article in this series, we looked at the various tests a smart anti-spam solution can make on E-mail contents. Adding this technique to those we discussed in the first article (testing the E-mail envelopes) gives us a wide range of options to choose from.
However, the best approach of all is to combine as many spam identification tricks as possible to maximize your chances of sorting the spam from the ham so that your users can go on with their lives with minimal concerns. After all, if you live by the mail administrator credo of Lose No Mail, your anti-spam solution will also let users handle their own quarantines and train their own filters.
The Safety (and Wisdom) of Combined Solutions
A combined filtering approach makes sense for a number of reasons. For one thing, spammers are always looking for a clever way to foil the latest in spam-blocking techniques. Look at the Bayesian poison pills we discussed in the previous article in this series, for example, and the Distributed Denial of Service attacks (DDoSs) that were successfully used to not only shut down a couple of DNSBLs, but to cripple the anti-spam software that relied on those external sources for making their decisions. When these attacks occurred, there were administrators whose E-mail servers were unable to deliver mail until their spam-fighting measures were disabled, or until new ones were hastily installed.
A combined approach relies on scoring techniques similar to those used by feature recognizers. Such tools offer a range of supported tests, from envelope to content to the many additional types available. When configuring a combined tool, the administrator tells it which tests to use; these tests are assigned numeric values according to how much each particular test is trusted to accurately spot spam. The more reliable the test is for spotting spam, the higher the score; the more reliably it identifies ham, the lower the score. Once the battery of available tests is finished on a particular message, the tool adds up all the positive and negative values to calculate an overall score.
Rather than making you try to imagine what's happening, we'll take a look at an example. Let's say that we have the following piece of E-mail, which arrives from a host claiming to be "mail.aol.com":
Received: from xx.xx.xx.xx (bogus [xx.xx.xx.xx]) by mail.example.com (8.12.10/8.12.10/1.1) with SMTP id i1B5ktbr020202 for <recipient@example.com>; Tue, 10 Feb 2004 21:47:02 -0800 Message-Id: <200402110547.i1B5ktbr020202@mail.example.com> From: EHFVMTDRRXHKDU@aol.com To: recipient@example.com Subject: 100% Verified E-mail Addresses: 525 million (5cdS) ONLY $99.00 Date: Wed, 11 Feb 2004 18:47:25 -0500 MLM Marketing Opportunities! 535 million Email Addresses in a 5-disk set REGULARLY $637.00 NOW ONLY $99.00
When this E-mail message arrives at our anti-spam filter, it faces both envelope tests and content tests. The following specific tests might be triggered in this case:
Rule |
Why it was Triggered |
---|---|
SPF_FAIL |
The host that connected to the mail server (xx.xx.xx.xx) was checked against the SPF records for its stated domain (aol.com), and was not found on the list of authorized mail servers for that domain. This is an envelope test, but rather than basing our diagnosis solely on this fact, we just assign a positive score (+1.604) to this suspicious occurrence. |
NO_REAL_NAME |
The "From:" header usually also contains the sender's real name, as in "Real Name <address>". If the sender didn't enter a real name when he configured his mail client, only the E-mail address will be shown. This isn't technically invalid, but it's suspicious, so our feature recognizer has a rule to spot this and assigns a positive score (+0.160) to the total. |
DATE_IN_FUTURE_12_24 |
Spammers often like to mess with the "Date:" header in the hope that your mail client will sort their mail closer to the top of the inbox. In this case, our feature recognizer detects that the date in the "Date:" header is 12-24 hours ahead of the date in the "Received:" header. Again, this is not conclusive by itself, since a drifting system clock could be to blame, but based on how often this rule is triggered in spam, and how rarely time differences this exaggerated appear in ham, we add a score of +3.332 to the total. |
MLM |
A feature recognizer looking for a common spam keyword or phrase like "MLM" and variations on "Multi-Level Marketing" would find a match in the body of this E-mail, adding a score of +1.787 to the total. |
MILLION_EMAIL |
Spam that offers CDs of "millions" of E-mail addresses is so common these days that there are hard-coded feature recognizer rules to identify patterns like "million(s) (of) (e-mail) addresses". Triggering this rule adds +1.999 to the total score for this message. |
DCC_CHECK |
By comparing the contents of this E-mail with samples submitted by others at the Distributed Checksum Clearinghouse (DCC), we find that many, many others have already received this particular spam and classified it as such. That earns this message another +2.907 points. |
RCVD_IN_SBL |
The connecting host's address (xx.xx.xx.xx) was looked up against the Spamhaus Block List (SBL), one of the more popular DNSBLs, and was listed there as a known spam source. This envelope test adds +1.113 to our total score. |
BAYES_60 |
This particular E-mail message does not contain a lot of text, so it doesn't offer a lot of tokens for the Bayes learning engine to work with. It finds some suspicious tokens, but also a number of tokens that are found often in legitimate mail, so its overall confidence level for this mail is between 60% and 70%. This adds +1.592 to the total. |
The final score for this example message is 14.494, but the score alone still does not tell us whether this mail is spam or hamthat has to be determined on an individual basis, according to the recipients' threshold score, which can be set by individual users (we'll get into user quarantines in a moment). If one recipient sets his spam threshold at 5.0, and another recipient doesn't consider E-mail to be spam unless its score is 15.0 or higher, this item would clearly be spam to the first recipient and ham to the second.
Setting spam threshold scores can be a bit of an art. The various test scores are usually calibrated against a fixed value, such as 5.0, based on analyzing a huge sample of spam and ham, so it makes sense to start with that score. If too much spam is slipping through the filter, lower the threshold score a bit; if too much legitimate mail is ending up in the quarantine, raise the threshold score a bit.
Hopefully, even this brief example helps to illustrate the power of a combined envelope and content testing solution. The more tests the better! Next in this series, we will address how you can combine all of this data to help your users keep the spam out of their inbox, without breaking that all-important mail administrator rule: Lose No Mail.