Good algorithm to check if post frequency is spam

https://stackoverflow.com/questions/7832769

27-10-2019
|

Question

I have a site where a people can post text. Each post is stored in a database with the ip of the poster and the time of the post. I want to be able to display a recaptcha if I can determine that the poster is a bot, spammer, etc.

What is a good algorithm to do this? The simplest choice is to analyze whether the number of posts in a pre-determined time period, say one minute, is greater than a chosen limit, say 10. However, this has the flaw of falling to multiple people posting from behind the same ip, or even a bot that creates random frequency intervals > the time period, or posts less than the limit in that time period.

Obviously there is no "correct" answer. Some algorithms are better than others however, and I am just trying to find the best one.

Solution

You can have a limit-based approach, and make good use of website analytics.

There must be limits to how many times an IP will post things in a single context. For example, for a StackExchange question (context), my IP address will (in most cases) post a single answer (not comments). Any more than one answer is uncommon, and hence, suspicious. In some other context, the frequency can be upto a few times, such as StackExchange comments.

Then there must be limits for time spent by a user in a single visit. If you are using ~~google~~ website analytics, you must be knowing the average time a user spends on your site. Make the time limit ~~a bit~~ considerably greater than that, or any other criteria you can come up with, including a hit and trial approach.

Also, you can use the blogger approach, but with a minor change. Instead of having a captcha at each post, have it once the user logs in or makes the first post. After that, put up a captcha only after some time interval or some number of posts by him/her.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow