Filtering spam from retrieved data

https://datascience.stackexchange.com/questions/387

16-10-2019
|

Question

I once heard that filtering spam by using blacklists is not a good approach, since some user searching for entries in your dataset may be looking for particular information from the sources blocked. Also it'd become a burden to continuously validate the current state of each spammer blocked, checking if the site/domain still disseminate spam data.

Considering that any approach must be efficient and scalable, so as to support filtering on very large datasets, what are the strategies available to get rid of spam in a non-biased manner?

Edit: if possible, any example of strategy, even if just the intuition behind it, would be very welcome along with the answer.

Solution

Spam filtering, especially in email, has been revolutionized by neural networks, here are a couple papers that provide good reading on the subject:

On Neural Networks And The Future Of Spam A. C. Cosoi, M. S. Vlad, V. Sgarciu http://ceai.srait.ro/index.php/ceai/article/viewFile/18/8

Intelligent Word-Based Spam Filter Detection Using Multi-Neural Networks Ann Nosseir, Khaled Nagati and Islam Taj-Eddin http://www.ijcsi.org/papers/IJCSI-10-2-1-17-21.pdf

Spam Detection using Adaptive Neural Networks: Adaptive Resonance Theory David Ndumiyana, Richard Gotora, and Tarisai Mupamombe http://onlineresearchjournals.org/JPESR/pdf/2013/apr/Ndumiyana%20et%20al.pdf

EDIT: The basic intuition behind using a neural network to help with spam filtering is by providing a weight to terms based on how often they are associated with spam.

Neural networks can be trained most quickly in a supervised -- you explicitly provide the classification of the sentence in the training set -- environment. Without going into the nitty gritty the basic idea can be illustrated with these sentences:

Text = "How is the loss of the Viagra patent going to affect Pfizer", Spam = false Text = "Cheap Viagra Buy Now", Spam = true Text = "Online pharmacy Viagra Cialis Lipitor", Spam = true

For a two stage neural network, the first stage will calculate the likelihood of spam based off of if the word exists in the sentence. So from our example:

viagra => 66% buy => 100% Pfizer => 0% etc..

Then for the second stage the results in the first stage are used as variables in the second stage:

viagra & buy => 100% Pfizer & viagra=> 0%

This basic idea is run for many of the permutations of the all the words in your training data. The end results once trained is basically just an equation that based of the context of the words in the sentence can assign a probability of being spam. Set spamminess threshold, and filter out any data higher then said threshold.

OTHER TIPS

Blacklists aren't have value for a number of reasons:

They're easy to set up and scale - it's just a key/value store, and you can probably just re-use some of your caching logic for the most basic implementation.
Depending on the size and type of the spam attack, there will probably be some very specific terms or URLs being used. It's much faster to throw that term into a blacklist than wait for your model to adapt.
You can remove items just as quickly as you added them.
Everybody understands how they work and any admin can use them.

The key to fighting spam is monitoring. Make sure you have some sort of interface showing which items are on your blacklist, how often they've been hit in the last 10 minutes / hour / day / month, and the ability to easily add and remove items.

You'll want to combine a number of different spam detection models and tactics. Neural nets seem to be a good suggestion, and I'd recommend looking at user behavior patterns in addition to just content. Normal humans don't do things like send batches of 1,000 emails every 30 seconds for 12 consecutive hours.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange