Question

I've got a Rails app where users can send messages to other users. The problem is, it's the type of site that draws many spammers who send bogus messages.

I'm already aware of a couple spam services like Akismet (via rakismet) and Defensio (via defender). The problem with these is that it looks like they don't take into account messages the user has already sent. The type of spam I'm seeing on my site is where the user sends the same (or very similar) messages to many other users. As such, I'd like to be able to compare to at least a handful of past messages to ensure they're different enough to not be considered spam.

So far, the best thing I've come across is the Text::Levenshtein distance implementation, which calculates the number of differences between two strings. I suppose I could calculate the number of difference divided by the string length, and if it's above a certain threshold, then it's not considered spam.

One other thing I've come across is Classifier::Bayes, which makes a best guess as to what category something falls into. Still pondering on this one.

I feel like I might just be looking in the wrong place, and maybe there's already a better solution for something like this out there. Perhaps I'm searching for the wrong words to find something a little more useful.

Was it helpful?

Solution

Don't try and roll your own solution for this, it's much more complex than you would expect. It is infact one of those things, like encryption, where it is a much better idea to farm it out to someone/something that is really good at it. Here is some background for you.

Levenshtein distance is certainly a good thing to be aware of (you never know when a similarity metric will come in handy), but it is not the right thing to use for this particular problem.

A Bayesian classifier is much closer to what you're after. Infact spam detection is pretty much the canonical example of where a naive Bayesian classifier can do a tremendous job. Having said that you'd have to find a large collection of data (messages) that has been classified as spam and non-spam and that's similar to the types of messages you get on your site. You would then need to train your classifier and measure its performance. You'd need to tweak it and make sure you don't overfit it etc. While Classifier::Bayes is a decent basic implementation it will not give you a lot of support for this. Infact Ruby does suffer from a lack of good natural language processing libraries. There is nothing in Ruby to compare to python's NLTK.

Having said all of that, services like akismet will certainly have a bayesian classifier as one of the tools they use to determine if what you send them is spam or not. This classifier will likely be much more sophisticated than what you can build yourself, if for no other reason than the fact that they also have access to so much data. They likely also have other types of classifiers/algorithms that they will use, this is their core business after all.

Long story short, if I were you I would give something like Akismet another look. If you build a facility into your site where you or your users can flag messages as spam (for example via rakismet's spam! method), you'll be able to send this data to akismet and the service should learn pretty quickly that a particular kind of message is spammy. So if your users are sending many similar spammy messages, even if akismet doesn't pick this up straight away, after you flag a couple of these all the rest should be picked up automatically. If I were you I would be concentrating my efforts into experimenting in this direction rather than trying to roll your own solution.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top