Textual analysis of spam emails from spam archives

https://stackoverflow.com/questions/14735524

07-03-2022
|

Question

I trying to implement an anti-spam engine using probabilistic approach. The very first step is to analyse and do some research on types of words and their frequency in spams. So I wrote a very simple program in Java to filter out words from spam. I break the entire text file into lines and lines into words by using "split("\W")" (\W for space).

I downloaded spam archives and thought I will easily analyse or scan these txt files using this application. But soon, I got caught into a major problem! The text files contain HTML tags, links, headers of email, blah-blah...

Now I am wondering about how to tackle this? Shall I use an html parser or strengthen my logic of analysing these files?

The answer mainly depends on whether I will be faced with the same problem in the implementation phase? What do current spam filters do?

Solution

The mail envelope is standard although invisible part of emails. Without these headers, the message will not reach you. There is no need to write parsing logic by yourself when standard libraries do the job.

from email import message_from_string
mailfd = open("mfile_path_to_message").read()
message = message_from_string(mailfd)
print message.get("from")

If the your messages are in unix mbox format, Mailbox library will be helpful. For parsing rich text like HTML, BeautifulSoup is among better options.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow