Which spam corpus I can use in NLTK?

https://stackoverflow.com/questions/9876616

26-05-2021
|

Question

My question is fairly related to this one, but I decided to open another question thread. I hope it is fine.

I am building a spam filter using the NLTK in Python as well, but I've just started.

I am wondering which spam corpus I can use and how to import it? I have not found any 'built-in in the NLTK' spam corpora (here).

Thank you in advance.

Solution

This presentation uses the enron-spam dataset (200,000+ emails).

The training and testing sets come from a dataset of 200,000+ Enron emails which contain both “spam” and “ham” emails

OTHER TIPS

Spam is not hard to obtain. Reasonably fresh spam in large quantities is not necessarily a big challenge, either; the big conundrum is how to obtain ham. If you are only building your own spam filter, of course, you can use your own ham.

The SpamAssassin Public Corpus is getting very old, but there you have it; http://spamassassin.apache.org/publiccorpus/

There is also the corpora from the TREC spam track, which are somewhat larger, but not much newer or less biased; http://plg.uwaterloo.ca/~gvcormac/treccorpus/

Various enthusiasts continue to publish their spam on the web, but most fail to include full headers etc. If you are only interested in "bag of words" filtering, maybe that's good enough.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow