Question

I need to have control over what URLs are allowed to be indexed. To do this I want to allow google to index only URLs that are listed in my Sitemap(s), and disallow Google from indexing anything else.

Easiest way to solve this is if there is a way to configure robots.txt to disallow everything:

User-agent: *

Disallow: /

And at the same time allow every URL that is listed in:

Sitemaps: sitemap1.xml

Sitemaps: sitemap2.xml

Can the robots.txt be configured to do this? Or are there any other workarounds?

Was it helpful?

Solution

It's not a robots.txt related answer, it's related to the Robots protocol as a whole and I used this technique extremely often in the past, and it works like a charm.

As far as I understand your site is dynamic, so why not make use of the robots meta tag? As x0n said, a 30MB file will likely create issues both for you and the crawlers plus appending new lines to a 30MB files is an I/O headache. Your best bet, in my opinion anyway, is to inject into the pages you don't want indexed something like:

<META NAME="ROBOTS" CONTENT="NOINDEX" />

The page would still be crawled, but it won't be indexed. You can still submit the sitemaps through a sitemap reference in the robots.txt, you don't have to watch out to not include in the sitemaps pages which are robotted out with a meta tag, and it's supported by all the major search engines, as far as I remember by Baidu as well.

OTHER TIPS

You will have to add an Allow entry for each element in the sitemap. This is cumbersome, but it's easy to do something programmatically with something that reads in the sitemap, or if the sitemap is being created progarmmatically itself, then base it on the same code.

Note that Allow is an extension to the robots.txt protocol, and not supported by all search-engines, though it is supported by google.

By signing into http://www.google.com/webmasters/ you can submit sitemaps directly to google's search engine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top