Question

Let's assume we are using pretty URLs with mod_rewrite or something similar and have the following two routes:

  • /page
  • /page-two

Now we want to disallow only the first route (/page) to be crawled by robots.

# robots.txt
User-agent: *
Disallow: /page

Disallow (http://www.robotstxt.org/orig.html):

... For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

So the above robots.txt example is disallowing /page-two too, correct?

What is the correct way to get this done?

May it be the following code?

# robots.txt
User-agent: *
Disallow: /page/
Was it helpful?

Solution

From Google's robots.txt specifications:

At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined.

This means that it doesn't matter in what order you define them. In your case this should work:

User-agent: *
Disallow: /page
Allow: /page-

To make it more clear: Every url is matched against all paths. /page will match /page/123, /page/subdirectory/123/whateverishere.html, /page-123 and /page. The directive with the longest path that matches will be used. If both /page and /page- match, then the directive for /page- is used (Allow). If /page matches, but /page- doesn't match, the directive for /page is used (Disallow). If neither /page and /page- match, the default is assumed (Allow).

OTHER TIPS

User-agent: *
Allow: /page-two
Disallow: /page

So the above robots.txt example is disallowing /page-two too, correct?

Correct.

What is the correct way to get this done?

In the original robots.txt specification, this is not possible at all.

(Note that your last example does not block /page, as the disallowed URLs would have to start with /page/ (including the last slash).)

Some parsers understand Allow and/or wildcards, which can be used to solve your problem, but both are not part of the original specification. If you only have certain bots in mind, check their documentation to see what kind of "extensions" to robots.txt they support.

Alternatives:

  • Use the HTTP header X-Robots-Tag.
  • Use the meta element with the robots name (but note: noindex is about indexing, while robots.txt’s Disallow is about crawling).
  • Change the URL design of your site.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top