Question

I'm doing a very rudimentary tracking of page views by logging url, referral codes, sessions, times etc but finding it's getting bombarded with robots (Google, Yahoo etc). I'm wondering what an effective way is to filter out or not log these statistics?

I've experimented with robot IP lists etc but this isn't foolproof.

Is there some kind of robots.txt, htaccess, PHP server-side code, javascript or other method(s) that can "trick" robots or ignore non-human interaction?

Was it helpful?

Solution

Just to add - a technique you can employ within your interface would be to use Javascript to encapsulate the actions that lead to certain user-interaction view/counter increments, for a very rudimentary example, a robot will(can) not follow:

<a href="javascript:viewItem(4)">Chicken Farms</a>

function viewItem(id)
{
    window.location.href = 'www.example.com/items?id=' + id + '&from=userclick';
}

To make those clicks easier to track, they might yield a request such as

www.example.com/items?id=4&from=userclick

That would help you reliably track how many times something is 'clicked', but it has obvious drawbacks, and of course it really depends on what you're trying to achieve.

OTHER TIPS

It depends on what you what to achieve. If you want search bots to stop visiting certain paths/pages you can include them in robots.txt. The majority of well-behaving bots will stop hitting them.

If you want bots to index these paths but you don't want to see them in your reports then you need to implement some filtering logic. E.g. all major bots have a very clear user-agent string (e.g. Googlebot/2.1). You can use these strings to filter these hits out from your reporting.

Well the robots will all use a specific user-agent, so you can just disregard those requests.

But also, if you just use a robots.txt and deny them from visiting; well that will work too.

Don't redescover the weel!

Any statistical tool at the moment filters robots request. You can install AWSTATS (open source) even if you have a shared hosting. If you won't to install a software in your server you can use Google Analytics adding just a script at the end of your pages. Both solutions are very good. In this way you only have to log your errors (500, 404 and 403 are enough).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top