Question

I have written some rules to block few URL in Robot.txt Now i want to varied those rules. Is there any tools for verifying robot.txt?

I have written this rule:

Disallow: /classifieds/search*/    

to block these URLs:

http://example.com/classifieds/search?filter_states=4&filter_frieght=8&filter_driver=2
http://example.com/classifieds/search?keywords=Covenant+Transport&type=Carrier
http://example.com/classifieds/search/
http://example.com/classifieds/search

I also want to know what is the difference between these rules

Disallow: /classifieds/search*/
Disallow: /classifieds/search/
Disallow: /classifieds/search
Was it helpful?

Solution

Your rule Disallow: /classifieds/search*/ does not do what you want it to do.

First, note that the * character has no special meaning in the original robots.txt specification. But some parsers, like Google’s, use it as a wildcard for pattern matching. Assuming that you have this rule for those parsers only:

From your example, this rule would only block http://example.com/classifieds/search/. The three other URLs don’t have a / after search.


  • Disallow: /classifieds/search
    → blocks all URLs whose paths start with /classifieds/search

  • Disallow: /classifieds/search/
    → blocks all URLs whose paths start with /classifieds/search/

  • Disallow: /classifieds/search*/
    → for parsers following the original spec: blocks all URLs whose paths start with /classifieds/search*/
    → for parsers that use * as wildcard: blocks all URLs whose paths start with /classifieds/search, followed by anything, followed by /


For blocking the four example URLs, simply use the following:

User-agent: *
Disallow: /classifieds/search

This will block, for example:

  • http://example.com/classifieds/search?filter=4
  • http://example.com/classifieds/search/
  • http://example.com/classifieds/search/foo
  • http://example.com/classifieds/search
  • http://example.com/classifieds/search.html
  • http://example.com/classifieds/searching

OTHER TIPS

The problem with using a robots.txt is that it cannot block anything per se, but rather ask the webcrawler nicely not to crawl certain areas of your site.

As for verification, provided that the syntax is valid, it should work, and you can monitor the server logs to see if some known compliant bots avoid those directories after reading robots.txt. This of course relies on the bots accessing your site complying with the standard.

There are a lot of online validators that can be used, such as http://www.frobee.com/robots-txt-check

And when it comes to those three rules:

> **Disallow: /classifieds/search*/** Disallow anything inside a directory where the name starts with search, but not the directory itself

> **Disallow: /classifieds/search/** Disallow anything inside the directory named search

> **Disallow: /classifieds/search** Disallow any directory starting with search

I dont have tested this myself, but did you try robots.txt checker? As for the difference between the three rules, I'd say that

  • Disallow: /classifieds/search*/ disallows all subdirectories of /classifieds/ beginning with "search"
  • Disallow: /classifieds/search/ only disallows the /classifieds/search/ directory
  • Disallow: /classifieds/search disallows the visit of a file called /classifieds/search
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top