Question

how to disallow all dynamic urls in robots.txt

Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

i want to disallow all things that start with /?q=

Was it helpful?

Solution

The answer to your question is to use

Disallow: /?q=

The best (currently accessible) source on robots.txt I could find is on Wikipedia. (The supposedly definitive source is http://www.robotstxt.org, but site is down at the moment.)

According to the Wikipedia page, the standard defines just two fields; UserAgent: and Disallow:. The Disallow: field does not allow explicit wildcards, but each "disallowed" path is actually a path prefix; i.e. matching any path that starts with the specified value.

The Allow: field is a non-standard extension, and any support for explicit wildcards in Disallow would be a non-standard extension. If you use these, you have no right to expect that a (legitimate) web crawler will understand them.

This is not a matter of crawlers being "smart" or "dumb": it is all about standards compliance and interoperability. For example, any web crawler that did "smart" things with explicit wildcard characters in a "Disallow:" would be bad for (hypothetical) robots.txt files where those characters were intended to be interpreted literally.

OTHER TIPS

As Paul said a lot of robots.txt interpreters are not too bright and might not interpret wild-cards in the path as you intend to use them.

That said, some crawlers try to skip dynamic pages on their own, worrying they might get caught in infinite loops on links with varying urls. I am assuming you are asking this question because you face a courageous crawler who is trying hard to access those dynamic paths.

If you have issues with specific crawlers, you can try to investigate specifically how that crawler works by searching its robots.txt capacity and specifying a specific robots.txt section for it.

If you generally just want to disallow such access to your dynamic pages, you might want to rethink your robots.txt design.

More often than not, dynamic parameter handling "pages" are under a specific directory or a specific set of directories. This is why it is normally very simple to simply Disallow: /cgi-bin or /app and be done with it.

In your case you seem to have mapped the root to an area that handles parameters. You might want to reverse the logic of robots.txt and say something like:

User-agent: * 
Allow: /index.html
Allow: /offices
Allow: /static 
Disallow: /

This way your Allow list will override your Disallow list by adding specifically what crawlers should index. Note not all crawlers are created equal and you may want to refine that robots.txt at a later time adding a specific section for any crawler that still misbehaves.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top