Question

How can one detect a crawler / spider using PHP?

I'm currently working on a project where I need to keep track of each crawler's visit.
I know that you should use HTTP_USER_AGENT but I'm not really sure how to format the code for this purpose and i know that the USER AGENT can be changed very easy so i would also like to know if it is possible to add some more parameters to avoid spoofing?

Sample code of what i'm trying to do..

<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>

Thank you

Was it helpful?

Solution

According to Verifying Googlebot:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

You can do a reverse DNS lookup:

function validateGoogleBotIP($ip) {
    $hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

    return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
    if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
        echo 'It is ACTUALLY google';
    } else {
        echo 'Someone\'s faking it!';
    }
} else {
    echo 'Nothing to do with Google';
}

OTHER TIPS

100% Working on my website to detect robots, crawlers, spiders and copiers.

function isBotDetected() {

    if ( preg_match('/abacho|accona|AddThis|AdsBot|ahoy|AhrefsBot|AISearchBot|alexa|altavista|anthill|appie|applebot|arale|araneo|AraybOt|ariadne|arks|aspseek|ATN_Worldwide|Atomz|baiduspider|baidu|bbot|bingbot|bing|Bjaaland|BlackWidow|BotLink|bot|boxseabot|bspider|calif|CCBot|ChinaClaw|christcrawler|CMC\/0\.01|combine|confuzzledbot|contaxe|CoolBot|cosmos|crawler|crawlpaper|crawl|curl|cusco|cyberspyder|cydralspider|dataprovider|digger|DIIbot|DotBot|downloadexpress|DragonBot|DuckDuckBot|dwcp|EasouSpider|ebiness|ecollector|elfinbot|esculapio|ESI|esther|eStyle|Ezooms|facebookexternalhit|facebook|facebot|fastcrawler|FatBot|FDSE|FELIX IDE|fetch|fido|find|Firefly|fouineur|Freecrawl|froogle|gammaSpider|gazz|gcreep|geona|Getterrobo-Plus|get|girafabot|golem|googlebot|\-google|grabber|GrabNet|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|HTTrack|ia_archiver|iajabot|IDBot|Informant|InfoSeek|InfoSpiders|INGRID\/0\.1|inktomi|inspectorwww|Internet Cruiser Robot|irobot|Iron33|JBot|jcrawler|Jeeves|jobo|KDD\-Explorer|KIT\-Fireball|ko_yappo_robot|label\-grabber|larbin|legs|libwww-perl|linkedin|Linkidator|linkwalker|Lockon|logo_gif_crawler|Lycos|m2e|majesticsEO|marvin|mattie|mediafox|mediapartners|MerzScope|MindCrawler|MJ12bot|mod_pagespeed|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|NationalDirectory|naverbot|NEC\-MeshExplorer|NetcraftSurveyAgent|NetScoop|NetSeer|newscan\-online|nil|none|Nutch|ObjectsSearch|Occam|openstat.ru\/Bot|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pingdom|pinterest|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|rambler|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Scrubby|Search\-AU|searchprocess|search|SemrushBot|Senrigan|seznambot|Shagseeker|sharp\-info\-agent|sift|SimBot|Site Valet|SiteSucker|skymob|SLCrawler\/2\.0|slurp|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|spider|suke|tach_bw|TechBOT|TechnoratiSnoop|templeton|teoma|titin|topiclink|twitterbot|twitter|UdmSearch|Ukonline|UnwindFetchor|URL_Spider_SQL|urlck|urlresolver|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|wapspider|WebBandit\/1\.0|webcatcher|WebCopier|WebFindBot|WebLeacher|WebMechanic|WebMoose|webquest|webreaper|webspider|webs|WebWalker|WebZip|wget|whowhere|winona|wlm|WOLP|woriobot|WWWC|XGET|xing|yahoo|YandexBot|YandexMobileBot|yandex|yeti|Zeus/i', $_SERVER['HTTP_USER_AGENT'])
    ) {
        return true; // 'Above given bots detected'
    }

    return false;

} // End :: isBotDetected()

To properly validate if a visitor is from a search engine you need more than just checking the User-Agent which can be easily spoofed.

The proper method is to look up the hostname of the IP and do a quick check if it matches any of the hostnames we know search engines crawlers use.

If the hostname matches one of the known crawlers, THEN you look up the IP of the hostname and see if the two are a match. If one of the steps fails, you have a fake search engine crawler visiting.

The following function accepts an IP and follows previously mentioned steps. It identifies Baidu, Bing, Google, Yahoo, and Yandex.

 /**
 * Validate a crawlers IP against the hostname 
 * Warning - str_ends_with() requires PHP 8
 *
 * @param   mixed   $ip 
 * @return  boolean
 */
function validate_crawler_ip( $testip ) {
   $hostname = strtolower( gethostbyaddr( $testip ) );
   $valid_host_names = array(
      '.crawl.baidu.com',
      '.crawl.baidu.jp',
      '.google.com',
      '.googlebot.com',
      '.crawl.yahoo.net',
      '.yandex.ru',
      '.yandex.net',
      '.yandex.com',
      '.search.msn.com',
   );
   $valid_ip = false;
   foreach ( $valid_host_names as $valid_host ) {
     // Using string_ends_with() to make sure the match is in the -end- of the hostname (to prevent fake matches)
     if ( str_ends_with( $hostname, $valid_host ) ) { // PHP 8 function
      $returned_ip = gethostbyname( $hostname );        
      if ( $returned_ip === $testip ) {
        // The looked up IP from the host matches the incoming IP - we have validated!
        return true;
      }
    }
  }
  // No match - not valid crawler
  return false;
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top