Question

I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).

I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.

Was it helpful?

Solution

Here is what I was looking for: http://untroubled.org/spam/

This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com

OTHER TIPS

Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.

I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.

A few characteristics of the Spambase set:

  • 4601 data points--all complete

  • each comprised of 58 features (attributes)

  • each data point is labelled 'spam' or 'no spam'

  • approx. 40% are labeled spam

  • of the features, all are continuous (vs. discrete)

  • a representative feature: average continuous sequence of capital letters


Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.

SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.

You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.

The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).

The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/

A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.

I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top