Question

I've blocked the crawlers to crawl my web root (/var/www/ in my case) with robots.txt. I've robots.txt in /var/www/ and it has the below line in it: Disallow /

Now I need one of my subdirectory of the web root(/var/www/mysite.com) to be crawled by crawlers. I've added the robots.txt in that directory and added the virtualhost in the apache to allow this mysite.com to be crawled. BUT the crawlers are still takes the robots.txt from my web root(/var/www) instead of (/var/www/mysite.com).

Thanks in advance for help.

Was it helpful?

Solution

You only specify one robots.txt which goes in your root directory.

More information can be found in the official documentation

Where to put it

The short answer: in the top-level directory of your web server.

The longer answer:

When a robot looks for the "/robots.txt" file for URL, it strips the path component from the URL (everything from the first single slash), and puts "/robots.txt" in its place.

For example, for "http://www.example.com/shop/index.html, it will remove the "/shop/index.html", and replace it with "/robots.txt", and will end up with "http://www.example.com/robots.txt".

Also from the same page (at the bottom) it gives an example of allowing only a certain webpage:

To exclude all files except one

This is currently a bit awkward, as there is no "Allow" field.

The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:

User-agent: *
Disallow: /~joe/stuff/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: * 
Disallow: /~joe/junk.html 
Disallow: /~joe/foo.html 
Disallow: /~joe/bar.html
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top