[lug] Bayesian Spam Filters

Mon Nov 4 11:51:59 MST 2002

Bernard Johnston <berjoh at attbi.com> writes:

> There was an article on today's Slashdot about a Bayesian-based spam
> filter written
> 
> as a simple Perl script. Looks quite effective and easy to
> implement. Has anyone tried this?

I've not tried those specific implementations, but I've been using
ifile for sorting and filtering my email for the past three weeks.

In order to use Bayesian techniques, you need a substantial amount of
training data already set up.  If you used procmail filters (or some
equivalent) in the past to do this, and correct the mistakes of
whatever spam filtering you've used so that everything is already
categorized, you should be fine.

My archives consist of about 60,000 messages, many of which are
mailing lists, but not all.  Across them, there are about 25 million
words.  Instead of doing just a two-category Spam/Non-spam filtering,
I opted to completely switch to using ifile to replace *all* my
rule-based filtering, including spam.

I've got roughly 80 folders, one of which is for spam.  Considering
the number of places an email could go, and the likelihood of
categorizing a message wrong simply by chance, naive Bayesian
filtering has done surprisingly well on my mail.  About 5% of my
messages are misfiled, and there only seems to be one predictable
source of miscategorized messages: yahoo-based lists and email.

Yahoo appends a substantial amount of junk to the bottom of messages
going through their system.  It's not uncommon for Yahoo's
advertisements for themselves and others to be a larger proportion of
the body of the email than the message itself.  The largest source of
misfiling is where ifile sends a message to the wrong Yahoo group
category because, by chance, the same advertisement occurred more
times in that group than the right one, and the content of the message
isn't unique enough to the general topic.

One of my side projects now is to implement a bigram-based
modification to ifile, so that the frequency of word pairs occurring
together will be significant, and ifile will become *slightly* less
naive.  It will increase the amount of storage space necessary by an
unfortunate amount, but the miscategorization rate should drop
notably.  My ~/.idata file is presently 2506582 bytes.

As far as I'm concerned, I'd call its current ability a success, but
that's mostly because I've got a nice archive of email.  The success
of anything implementing naive Bayesian algorithms will depend on how
many emails you have archived, and how much and how well you've
categorized them.

-- 
Chris Riddoch       | epistemological
socket at peakpeak.com | humility