Question

I am building stats for my users and dont wish the visits from bots to be counted.

Now I have a basic php with mysql increasing 1 each time the page is called.

But bots are also added to the count.

Does anyone can think of a way?

Mainly is just the major ones that mess things up. Google, Yahoo, Msn, etc.

Was it helpful?

Solution

You should filter by user-agent strings. You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html Running through that list and ignoring bot user-agents before you run your SQL statement should solve your problem for all practical purposes.

If you don't want the search engines to even reach the page, use a basic robots.txt file to block them.

OTHER TIPS

You can check the User Agent string, empty strings, or strings containing 'robot', 'spider', 'crawler', 'curl' are likely to be robots.

preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT']));

We've a similar use-case to yourself, and one option we've recently found quite helpful is the UASParser class from user-agent-string.info.

It's a PHP class which pulls the latest set of user agent string definitions and caches them locally. The class can be configured to pull the definitions as often or as rarely as you deem fit. Automatically fetching them like this means that you don't have to keep on top of the various changes to bot user agents or new ones coming on the market, although you are relying on UAS.info to do this accurately.

When the class is called, it parses the current visitor's user agent and returns an associative array breaking out the constituent parts, e.g.

Array
(
    [typ] => browser
    [ua_family] => Firefox
    [ua_name] => Firefox 3.0.8
    [ua_url] => http://www.mozilla.org/products/firefox/
    [ua_company] => Mozilla Foundation
    ........
    [os_company] => Microsoft Corporation.
    [os_company_url] => http://www.microsoft.com/
    [os_icon] => windowsxp.png
)

The field typ is set to browser when the UA is identified as likely belonging to a human visitor, in which case you can update your stats.

Couple of caveats here:

  • You're relying on UAS.info for the user agent strings provided to be accurate and up-to-date
  • Bots like google and yahoo declare themselves in their user agent strings, but this method will still count visits from bots pretending to be human visitors (sending spoofed UAs)
  • As @amdfan mentioned above, blocking bots via robots.txt should stop most of them from reaching your page. If you need the content to be indexed but not increment stats, then the robots.txt method won't be a realistic option

Check the user agent before incrementing the page view count, but remember that this can be spoofed. PHP exposes the user agent in $_SERVER['HTTP_USER_AGENT'], assuming that the web server provides it with this information. More information about $_SERVER can be found at http://www.php.net/manual/en/reserved.variables.server.php.

You can find a list of user agents at http://www.user-agents.org; Googling will also provide the names of those belonging to the major providers. A third possible source would be your web server's access logs, if you can aggregate them.

Have you tried identifying them by their user-agent information? A simple google search should give you the user-agents used by Google etc.

This, of course, is not foolproof, but most crawlers by major companies supply a distinct user-agent.

EDIT: Assuming you do not want to restrict the bots access, but just not count its visit in your statistc.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top