Heuristics to discover spammers/bots (In forums, blogs etc)

https://stackoverflow.com/questions/735344

09-09-2019
|

Question

The ways I can think of are:

Measure the time between actions.
Compare the posts' content (if they're too similar to each other) or, better yet, only the posted links.
Checking the distribution over a period of time the user is active (if the user is active, say posting once every hour, for a week, then either we have a superman or a bot here).
Some special activity expected: like in stackoverflow, I would expect users to press their user name link (top middle) to see their new answers, comments, questions etc.
(added by chakrit) Number of links in a post.
Not heuristic. Use some async JS for user login. (Just makes life a bit harder on the bot programmer).
(added by Alekc) Not heuristic. User-agent values.
And, How could I forget Google's approach (mentioned down by Will Hartung). Give users the ability to mark someone as Spam, enough Spam votes means this is a Spam user. (calculating what is enough users, is the work here).

Any more ideas?

Solution

I might be over estimating the intelligence of bot creators, but number 6 is completely useless against any semi decent bot creator. Using the C# browser control to create your bot would pretty much render 6 useless. From what I've seen with that type of software that's a pretty common approach.

Validating on the useragent is pretty much useless too all of the blog spam I use to get was from bots appearing to be valid web browsers.

I use to get a lot of blog spam. I would literally be deleting hundreds of comments a day. I made use of reCaptcha and now I might get 1 a month.

If you really try to make something like this. I would attempt by doing the following:

User starts off with no ability to post a url.

After X number of posts have been analyzed in relation to the other posts in the thread then give them access to post urls.

The users activity on the site, the post quality, and what ever other factors you deem necessary will be a reputation for that users IP.

Then based the reputation of the IP and the other IPs on the same subnet you can make other decisions on whatever you want.

That was just the first thing that came to mind. Hope it helps.

OTHER TIPS

The number of links in a post.

I believe I've read somewhere that Akismet use the number of links as one of its major heuristics.

And most of spam comments at my blog contains 10+ links in them.

Speaking of which... you just might want to check out the Akismet API itself .. they are extremely effective.

How about a search for spam related keywords in the post body?

Not a heuristic but an effective approach: You can also keep up-to-date with the stats published by StopForumSpam using their APIs.

Time between page visits is common I believe.

I need to add a comment section to my personal site and am thinking of asking people to give me their email address; I'll email them a "publish comment" link.

You might want to check if they've come from a Spam blacklist IP address (See http://www.spamhaus.org/)

There is another answer that suggests using Akismet for detecting spam, which I completely endorse.

However, they are not the only player on the block.

There is TypePad AntiSpam which uses the same heuristics as Akismet, as well as the same API (just a different URL and api key, the structure of the calls is the same). It can be safe to say they pretty much take the same approach as Akismet.

You might also want to check out Project Honeypot. From what I can tell, it can do a lookup based on the IP address of the user, and if it is a known malicious IP, it will tell you (harvester or something like that).

Finally, you can check LinkSleeve which approaches comment spam with what it claims to be a different way. Basically, it checks the links that are being linked to in comments, and based on where the links are going to, makes a determination.

Don't forget the ultimate heuristic: The "Report Spam" button that users can click. If nothing else, this gives you as administrator a chance to update your rule base for stuff that may be slipping through. Of course, you can simply delete the offending post and user right away as well.

I have some doubts about 4° point, anyway i would also add User-Agent. It's pretty easy to fake, but in my experience, about 90% of bots are using Perl as UA

I am sure there is a webservice of some kind that you can get a list of top SEO keywords, check the content for those keywords. if the content is to rich in keywords suspect it as being spam.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow