Googlebot not respecting Robots.txt [closed]

https://stackoverflow.com/questions/463569

19-08-2019
|

Question

For some reason when I check on Google Webmaster Tool's "Analyze robots.txt" to see which urls are blocked by our robots.txt file, it's not what I'm expecting. Here is a snippet from the beginning of our file:

Sitemap: http://[omitted]/sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: http://[omitted]/Living/books/book-review-not-stupid.aspx
Disallow: http://[omitted]/Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: http://[omitted]/Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

Anything in the scripts folder are correctly blocked for both the Googlebot and Mediapartners-Google. I can see that the two robots are seeing the correct directive because the Googlebot says the scripts are blocked from line 7 while the Mediapartners-Google is blocked from line 4. And yet ANY other url I put in from the disallowed urls under the second user-agent directive are NOT blocked!

I'm wondering if my comment or using absolute urls are screwing things up...

Any insight is appreciated. Thanks.

Solution

The reason why they are ignored is that you have the fully qualified URL in the robots.txt file for Disallow entries while the specification doesn't allow it. (You should only specify relative paths, or absolute paths using /). Try the following:

Sitemap: /sitemap_index.xml

User-agent: Mediapartners-Google
Disallow: /scripts

User-agent: *
Disallow: /scripts
# list of articles given by the Content group
Disallow: /Living/books/book-review-not-stupid.aspx
Disallow: /Living/books/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx
Disallow: /Living/sportsandrecreation/book-review-running-through-roadblocks-inspirational-stories-of-twenty-courageous-athletic-warriors.aspx

As for caching, google tries to get a copy of the robots.txt file every 24 hours in average.

OTHER TIPS

It's the absolute URLs. robots.txt is only supposed to include relative URIs; the domain is inferred based on the domain that the robots.txt was accessed from.

It's been up for at least a week, and Google says it was last downloaded 3 hours ago, so I'm sure it's recent.

Did you recently make this change to your robots.txt file? In my experience it seems that google caches that stuff for a really long time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow