Fighting Spam

I decided today to revisit my spam-filtering setup today after deciding once again that I really do not want to know "this one WEIRD trick" for doing X, Y or Z any more. (Especially Z.)

I made a few small and not-so-small changes and I am already seeing positive results (which is to say I am not seeing unwanted e-mail in my inbox at anywhere close to the previous rate).

The Status Quo

My email domain has been around for a while and I receive a LOT of spam. Sadly, that's just a fact of life these days. One of a number of spammy trends on the rise lately seems to be harvesting email addresses from the inboxes and address books of compromised email accounts. So even if never gave out my email address except by actually emailing people it would still fall into the hands of undesirables.

Even before my recent changes, I already had some measures in place to prevent spam from reaching my inbox.

DNSBL

I use a few real-time DNS blacklist services to reject mail from IP addresses with a poor reputation. There are a number of these and they all have different policies and sources of information. I chose several that I thought were conservative and stopped using a couple of them after getting negative feedback from some of my users. Even so, the remaining blacklists prevent a large number of messages from ever being accepted by my server.

SpamAssassin

I also use SpamAssassin to scan mail that is not rejected by the DNSBL. SpamAssassin combines a large number of tests which can be run against an incoming message to assign it a "spamminess" score. It includes some adaptive features including an auto-whitelist and a Bayesian (machine learning) classifier. I regularly feed "spam" and "ham" messages to the Bayesian learner so it can correlate markers in future messages with what it has already seen in actual email delivered to my server. Messages with a really high spam score are automatically fed to the learner and then discarded. Messages with a medium-high score are delivered to my Spam folder, and messages with a low score are delivered to my inbox. Over 99% of the messages in my Spam folder are indeed spam, but for the remaining <1% it is nice to still have a copy of the falsely-identified-as-spam message (both so I can read it and so I can feed it as 'ham' to the learner). The latest version of SpamAssassin, 3.4, provides a number of nice improvements over the aging 3.3 series of releases.

Beefing Things Up

While the above was preventing and/or catching a high percentage of spam, I was still seeing too many unwanted messages in my inbox.

More SpamAssassin

In looking at the unwanted messages which were still getting through, I noticed that many of them had several SpamAssassin tests in common. While the default test scores are generally very good, I determined that I could push many messages' scores "over the edge" by increasing the scores of several of these tests. It is tempting to make the scores a lot higher or to raise the scores for a large number of tests, but making small increases to only a few tests at a time is a better approach. Since false positives potentially mean losing real (wanted) email, I prefer to watch the results of conservative changes for a while before making any more. So far I'm happy.

It is also possible to write your own tests for SpamAssassin. So far I have only done one but it covered a lot of messages that would otherwise have gotten through (I penalize messages whose message id (which typically includes the sender) ends in .me).

MX Records

MX records are a simple thing, but they can be surprisingly powerful in deterring spammers. In DNS, the MX record(s) for a domain indicate which server(s) may accept mail for that domain. MX records include a numeric priority as well, so if there is more than one server you may assign a different priority to each. MTAs that follow the rules (the ones most legitimate users use), will try the lowest-numbered MX first, and try higher-numbered ones in order only if the lower-numbered ones failed. Spammers' MTAs do not always follow the rules. A spammer is typically more interested in delivering many messages quickly than in ensuring delivery to any particular address. So such an MTA may try only the first MX. Some spam is delivered only to the last (highest-numbered) MX, in the hope that a non-primary server will have fewer safeguards. Or maybe an MX will simply be chosen at random.

MTAs that are non-compliant in any of these ways can easily be discouraged without penalizing legitimate MTAs by introducing one or more fake MX records. That may seem like a bad idea, but it makes sense. I was inspired by this page, which explains the idea in greater detail.

I ended up setting four MX records of differing priorities instead of the one I had before. The lowest and highest are fake, and the middle two are my primary and secondary mail servers, respectively. I had partially set up the secondary server before but wasn't using it. More on it below. Attempts to connect to the SMTP port on the lowest MX IP are immediately rejected. Connections to the SMTP port on the highest MX IP are held for 10 seconds and then rejected.

Now instead of all MTAs (good and bad) trying my one and only MX server There is some distinction. A good MTA will try the first (fake) MX, immediately fail and try the second, which is my primary mail server. If that server is temporarily unavailable (for whatever reason), the MTA will try the third MX, which is my secondary server. The mail can be queued there for later delivery. In the still-less-likely case that both my mail servers are unavailable, the MTA will try the fourth and final MX, fail immediately, and then queue the message locally until it decides to try again.

A bad MTA may do one of several things. It may try the first MX and fail, then give up. Great! One less spam message delivered. It may try the last MX and fail, then give up. Great! One less spam message delivered. It may try one MX at random. Half of the time it will fail immediately and 25% of the time it will hit the secondary server. There's only a 25% chance it will hit my primary server. (As explained below, spam is less likely to get through the secondary server if it is delivered there first.) If the bad MTA tries more than one MX, there's still an increased chance that the primary server will not be among them.

From my observations so far, this has dramatically reduced the number of delivery attempts on the primary server with only a few landing on the secondary server. I didn't have to make any changes to the primary server itself.

Backup Mail Server

For relatively small setups like mine, having a backup mail server can be good or bad, and is often both. (Most larger setups inherently require multiple servers for fault tolerance, load balancing and/or scalability.)

The Good

On the plus side, it provides, well, backup mail service. If the primary server is down (whether by design or by accident), incoming mail has a safe place to hang out until it can be delivered to the primary server. As already noted, a well-behaved MTA (someone else's) will probably queue mail it couldn't send immediately but you never know if or for how long it will do so, or how long it will take it to attempt delivery a second (or third, etc) time. When the mail is queued on your own backup server you can control all those variables, and if you know about an outage on the primary server you can tell the backup to deliver all its queued mail as soon as you know the primary is back up.

The Bad

On the flip side, having a backup mail server means more administration, and may mean additional monetary expense and/or other costs. Additionally, it increases the "attack surface" for spammers and anyone else who might want to misuse your mail service. If it is not configured properly, a secondary server can act as a back door for spam to your inbox. A secondary server should be at least as strict as the primary server about what email it will accept and from whom.

I opted to implement a backup server but to put even more aggressive spam controls on it than I have on the primary server.

More DNSBL

As I already mentioned, I got some feedback that some of the DNS blacklists I originally liked were preventing some legitimate email from coming through on my primary server. Since the secondary server is not used for legitimate email in most normal circumstances I decided to use all the blacklists I like plus a couple of new ones. If I ever expect my primary server to be down for an extended period of time I can suspend using them, but otherwise there shouldn't be a problem.

Greylisting

Like the MX techniques above, greylisting takes advantage of the fact that many spammers' MTAs don't bother to follow all the rules of SMTP. Specifically, if the receiving mail server indicates a temporary failure (which can happen normally if there is high load on the server or certain types of maintenance), the sending server is supposed to wait a little while and then try again. Real MTAs do this. Spammer MTAs usually do not. A greylisting server keeps track of "tuples" consisting of the recipient, sender, and sender IP of an email. The first time a given tuple is encountered a temporary failure is given and the tuple is recorded along with a timestamp. If the same tuple is encountered again and enough time has elapsed since the original attempt (usually only a few minutes), the message is accepted and the tuple is added to a whitelist for a set amount of time (days or weeks, usually).

Greylisting is a very effective technique since so much spam comes from lazy MTAs that will never come back and try to send a given message a second time. Unfortunately, it impacts its users as well. All email from non-whitelisted senders is delayed. Even though the greylisting server may only require a few minutes before it will accept the message, the sending MTA might not try again for 10 or 15 minutes or even longer. Worse, there are some legitimate MTAs out there that are simply broken, so there is a risk of certain messages not arriving at all.

Once again, these concerns are of less importance on a secondary server so I decided to implement greylisting there but not on the primary server.

Because of the way my MX records are set up the secondary server doesn't see a lot of action, but the greylisting is working as intended and has prevented at least a few spam messages from reaching my mailbox.

Greylisting, Take 2

[UPDATE: This section added 8/27/14]

After watching my improved setup for a few days I decided I'd like at least some greylisting on my primary server (in addition to the wholesale greylisting performed on the secondary server). I still didn't want to subject good e-mail to unnecessary delays, so the question became how to tell the difference early enough in the mail delivery process for greylisting to have an effect.

The greylisting tool I use (greylist-milter) has some support for integrating with SpamAssassin's spamd program. The idea is for milter-greylist to accept the body (DATA) of an e-mail, send it to SpamAssassin for scoring, then black/grey/whitelist based on the result. Unfortunately I could not get this to work properly. It looked like milter-greylist would send some of the headers to spamd but then time out before any of the body was available. Probably some incompatibility between the versions of Sendmail, SpamAssassin and milter-greylist but I did not take much time to debug it.

Instead of trying to front-load everything into the initial SMTP connection with would-be spammers, I decided to adopt a "fool me once, shame on you; fool me twice, shame on me" strategy. I like to think of it as adaptive greylisting after the fact. I created a small script to read various pieces of information from every email message delivered to my server and store the information in a database. The information includes the SpamAssassin score, the IP address and associated network block of the sender and other details. After collecting this information and playing with it for a while, I wrote another script. This script creates a blacklist of sender IP addresses for messages with a spam score above 9.0, and a greylist of sender IP blocks for messages with a spam score above 4.0. This script runs several times an hour and inserts the updated black/grey lists into the milter-greylist configuration.

The default action on my primary server is still not to greylist. Now, however if a sender who previously sent 'spammy' email to the system tries to send another one, it will be black- or greylisted (depending on how spammy the prior email was). This approach lets the system adapt to what it has actually received, in addition to relying on external blacklists and other information. This system is easy to tune as I can adjust the score thresholds for black- or greylisting, and I can change other criteria based on what is in the database. For example, I may decide to only build the lists from email received in the last 14 days.

As a side note, I also set up MX syncing between my primary and secondary servers' milter-greylist installations. Among other benefits, this prevents a sender who was greylisted on one server from immediately sending its email through the other.

I will be watching my logs and the database for a while to determine how effective this change is. So far it appears to be another incremental improvement--several spammy-looking messages have been turned away by greylist-milter and so far only one has returned to actually deliver the message.

Delays

This final technique again takes advantage of the laziness of certain spammer MTAs, and it is very simple. When a client connects to the mail server, wait a few seconds before printing SMTP's initial 220 "ready" message. A nice MTA will wait patiently. A lazy MTA will hang up and leave, and an impatient MTA will start talking before it gets the 220 message and subsequently be ignored. I have a 2-second delay on my primary server and a 6-second delay on my secondary server.

Additional / Alternative Methods

Amavisd

Amavisd is a unified interface between the local MTA and content checkers such as SpamAssassin and virus scanners. It can bounce, quarantine, redirect or deliver a message based on the results of the content checks. I have used it for certain projects and I know people who swear by it, but my personal preference is to manage the pieces individually. With SpamAssassin in particular I feel I have finer control and better performance by running spamd directly. Plus I've never really liked the configuration syntax for amavisd, but I'm not much of a Perl guy either (the configuration needs to be written in Perl). Amavisd does make it easy to reject spam messages outright, but I don't find accepting and then deleting them to be a burden.

I will add other ideas here as I hear or think of them.

Results

All these changes are still fairly new, but as I've alluded to I'm getting less spam. A lot less, in both my inbox and my spam folder.

If I were more interested in the science and less interested in my inbox I would have made these changes one at a time and measured the results for a day or more in between each one. But I didn't. Sue me. :) I'm happy though, and I think my users will be as well.

Additional Results

[UPDATE: This section added 9/15/14]

I'm thrilled with the results. Before I started this project I would get dozens of spam messages in my inbox every day. Now I get one every couple days or so. My other users are seeing similar results. Hooray!

I have made a few modifications since my last update, and they helped cover the final distance:

  1. My script to generate an adaptive greylisting config file now runs every four minutes.
  2. I dropped the greylist "average spam score" threshold (in the above script) from 4.0 to 1.9. Legitimate messages with scores higher than that can generally be delayed without upsetting users. They tend to be ads or newsletters rather than messages from real people or one-off emails from web services. And once the first of a given newsletter is delivered (because it was re-sent after being greylisted), future messages will be whitelisted anyway.
  3. In addition to the black- and greylists from the database, the script also includes any addresses already tagged for greylisting by either MX. This has a double benefit: Legitimate mail hosts that retry delivery of greylisted messages can be whitelisted so future messages from them will not be delayed. Ill-behaved hosts that don't retry delivery will always be greylisted, so future messages from them will not be accepted (since the hosts don't ever retry delivery).
  4. All messages that arrive between midnight and 7:59am are greylisted. The only reason greylisting all the time is a problem is that users expect their mail to come in more or less immediately. But if the users aren't awake (or using email), they don't care. Additionally, spam traffic seems to be higher in the overnight hours. This allows the greylist program to learn about more addresses and block more spam without negatively impacting users. Combined with the previous change, this increases the reach and effectiveness of greylisting on the server, even during the day.

I'm very pleased with the result of all these changes. If additional changes become necessary over time I may update this post again but for now I'm calling the project a success.

john Thursday 21 August 2014 - 4:57 pm | | Linux, IT, FreeBSD
Used tags: , , , ,

No comments

Log in to post comments