Having problems understanding how to block some URLs on robot.txt

https://stackoverflow.com/questions/17302009

01-06-2022
|

Question

The problem is this. I have some URLs on the system I have that have this pattern

http://foo-editable.mydomain.com/menu1/option2
http://bar-editable.mydomain.com/menu3/option1

I would like to indicate in the robot.txt file that they should not be crawled. However, I'm not sure if this pattern is correct:

User-agent: Googlebot 
Disallow: -editable.mydomain.com/*

Will it work as I expect?

Solution

You can't specify a domain or subdomain from within a robots.txt file. A given robots.txt file only applies to the subdomain it was loaded from. The only way to block some subdomains and not others is to deliver a different robots.txt file for the different subdomains.

For example, in the file http://foo-editable.mydomain.com/robots.txt you would have:

User-agent: Googlebot
Disallow: /

And in http://www.mydomain.com/robots.txt you could have:

User-agent: *
Allow: /

(or you could just not have a robots.txt file on the www subdomain at all)

If your configuration will not allow you to deliver different robots.txt files for different subdomains, you might look into alternatives like robots meta tags or the X-robots-tag response header.

OTHER TIPS

I think you have to code it like this.

User-agent: googlebot 
Disallow: /*-editable.mydomain.com/

There's no guarantee that any bot will process the asterisk as a wild card, but I think the googlebot does.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow