How can robots.txt disallow all URLs except URLs that are in sitemap
-
27-09-2019 - |
Question
I need to have control over what URLs are allowed to be indexed. To do this I want to allow google to index only URLs that are listed in my Sitemap(s), and disallow Google from indexing anything else.
Easiest way to solve this is if there is a way to configure robots.txt to disallow everything:
User-agent: *
Disallow: /
And at the same time allow every URL that is listed in:
Sitemaps: sitemap1.xml
Sitemaps: sitemap2.xml
Can the robots.txt be configured to do this? Or are there any other workarounds?
Solution
It's not a robots.txt related answer, it's related to the Robots protocol as a whole and I used this technique extremely often in the past, and it works like a charm.
As far as I understand your site is dynamic, so why not make use of the robots meta tag? As x0n said, a 30MB file will likely create issues both for you and the crawlers plus appending new lines to a 30MB files is an I/O headache. Your best bet, in my opinion anyway, is to inject into the pages you don't want indexed something like:
<META NAME="ROBOTS" CONTENT="NOINDEX" />
The page would still be crawled, but it won't be indexed. You can still submit the sitemaps through a sitemap reference in the robots.txt, you don't have to watch out to not include in the sitemaps pages which are robotted out with a meta tag, and it's supported by all the major search engines, as far as I remember by Baidu as well.
OTHER TIPS
You will have to add an Allow
entry for each element in the sitemap. This is cumbersome, but it's easy to do something programmatically with something that reads in the sitemap, or if the sitemap is being created progarmmatically itself, then base it on the same code.
Note that Allow
is an extension to the robots.txt protocol, and not supported by all search-engines, though it is supported by google.
By signing into http://www.google.com/webmasters/ you can submit sitemaps directly to google's search engine.