Googlebot Unexplained 32-character hexadecimal appended string causing more than 20,000 404 errors per day

StackOverflow https://stackoverflow.com/questions/13666930

Question

I have a very interesting problem that I am failing to explain.

Every 2 to 6 seconds googlebot (I have looked up googlebots IP, its the real thing [using host IP]) is requesting a page on our site (running: php, apache, mongodb) that does not exist (404s). No other robot or human has ever requested a page like this! Just googlebot.

The requests each look something like this:

/2de4f853c2853807b2e72387aa8928a4

/ea5700c343d1a9798bc554af7c1a330e

/e5aafa102d54ba7517703336846cc019

Our code does not use any 32 char strings and there are no links anything like that internal or external of our site. We use codeigniter so at first I thought it was the default session_id, i have checked, it is not.

Has anyone ever seen anything like this? Our website uses history.push on some pages, could this cause it? Just an idea.

Raw Data of an example request:

array (
  'date' => '2012-12-01',
  'time' => '10:01:33 PM',
  'additional_data' => 
    array (
      'server_vars' => 
        array (
          'REDIRECT_STATUS' => '200',
          'HTTP_HOST' => 'www.xxxxxxx.com',
          'HTTP_ACCEPT' => '*/*',
          'HTTP_ACCEPT_ENCODING' => 'gzip,deflate',
          'HTTP_FROM' => 'googlebot(at)googlebot.com',
          'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
          'HTTP_X_FORWARDED_FOR' => 'xxxxxxx',
          'HTTP_X_FORWARDED_PORT' => '80',
          'HTTP_X_FORWARDED_PROTO' => 'http',
          'HTTP_CONNECTION' => 'keep-alive',
          'PATH' => '/sbin:/usr/sbin:/bin:/usr/bin:/home/ec2-user/ec2/bin',
          'SERVER_SIGNATURE' => '<address>Apache/2.2.22 (Amazon) Server at www.xxxxxxx.com Port 80</address>
',
          'SERVER_SOFTWARE' => 'Apache/2.2.22 (Amazon)',
          'SERVER_NAME' => 'www.xxxxxxx.com',
          'SERVER_ADDR' => 'xxxxxxxxxx',
          'SERVER_PORT' => '80',
          'REMOTE_ADDR' => '10.171.147.114',
          'REMOTE_PORT' => '40759',
          'REDIRECT_URL' => '/e5aafa102d54ba7517703336846cc019',
          'GATEWAY_INTERFACE' => 'CGI/1.1',
          'SERVER_PROTOCOL' => 'HTTP/1.1',
          'REQUEST_METHOD' => 'GET',
          'QUERY_STRING' => '',
          'REQUEST_URI' => '/e5aafa102d54ba7517703336846cc019',
          'SCRIPT_NAME' => '/index.php',
          'PATH_INFO' => '/e5aafa102d54ba7517703336846cc019',
          'PATH_TRANSLATED' => 'redirect:/index.php/e5aafa102d54ba7517703336846cc019',
          'PHP_SELF' => '/index.php/e5aafa102d54ba7517703336846cc019',
          'REQUEST_TIME' => 1354428093,
       ),
    'codeigiter_session' => 
      array (
        'session_id' => 'c795e40a279f58d9fbbf7f5501a26787',
        'ip_address' => '10.171.147.114',
        'user_agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
        'last_activity' => 1354428093,
        'user_data' => '',
    ),
  ),
)

What else can I collect to figure this out. Its very strange.


Update: The traffic is coming from 2 primary ip addresses. 10.171.147.114 & 10.161.46.102

I have looked these up and they are not GoogleBot.

I have gotten this info from one IP lookup site.

Remember that IP address ranges 10.0.0.0 – 10.255.255.255, 172.16.0.0 – 172.31.255.255, 192.168.0.0 – 192.168.255.255 and 224.0.0.0 - 239.255.255.255 are reserved IP Addresses for private internet use and IP lookup for these will not return any results.

What should / can I do about these requests? What is the point of these requests? If this is a type of DOS attack they are doing a very bad job at it.

Was it helpful?

Solution

To answer this question, the problem was being created by the aws load blancer's health checks. For some reason aws is using the googlebot user_agent to perform them on our servers.

OTHER TIPS

The first thing to do here is to collect as many IPs as possible and find the answer to 2 questions: 1. Can you group them by networks, like 66.249.66.XXX or 66.249.XXX.XXX? If you can't - this is not a Gbot 2. What are countries of these IPs? If you have dozens - this is not a Gbot.

I think, this not seem to be a Google Bot, because they do not tend to monitor a site without even a sitemap with this frequency (except for some special cases, like news sites).

Refer to

http://support.google.com/webmasters/bin/answer.py?hl=en&answer=80553

to learn how to recognize a Gbot. Try some online google bot ip lists. They may be outdated, but still give you information about address clusters. Even more, google bot ips are easily grouped by networks.

You can't trust to HTTP_USER_AGENT, because third party may easily forge it.

I'd say your site is under separated attack from some network.

I doubt they are trying to guess PHP_SESSID by sending this hash. The only reason for PHP_SESSID to appear in URL is then you have configured PHP not to store it in cookies (I think you didn't). It's easier and more natural to send session_id in cookies, even when attacking.

Check POST parameters and COOKIES they are sending. This may give you more information.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top