How to verified Robot.txt rules

Question 1

Your rule Disallow: /classifieds/search*/ does not do what you want it to do.

First, note that the * character has no special meaning in the original robots.txt specification. But some parsers, like Google’s, use it as a wildcard for pattern matching. Assuming that you have this rule for those parsers only:

From your example, this rule would only block http://example.com/classifieds/search/. The three other URLs don’t have a / after search.

Disallow: /classifieds/search
→ blocks all URLs whose paths start with /classifieds/search
Disallow: /classifieds/search/
→ blocks all URLs whose paths start with /classifieds/search/
Disallow: /classifieds/search*/
→ for parsers following the original spec: blocks all URLs whose paths start with /classifieds/search*/
→ for parsers that use * as wildcard: blocks all URLs whose paths start with /classifieds/search, followed by anything, followed by /

For blocking the four example URLs, simply use the following:

User-agent: *
Disallow: /classifieds/search

This will block, for example:

http://example.com/classifieds/search?filter=4
http://example.com/classifieds/search/
http://example.com/classifieds/search/foo
http://example.com/classifieds/search
http://example.com/classifieds/search.html
http://example.com/classifieds/searching

Question 2

The problem with using a robots.txt is that it cannot block anything per se, but rather ask the webcrawler nicely not to crawl certain areas of your site.

As for verification, provided that the syntax is valid, it should work, and you can monitor the server logs to see if some known compliant bots avoid those directories after reading robots.txt. This of course relies on the bots accessing your site complying with the standard.

There are a lot of online validators that can be used, such as http://www.frobee.com/robots-txt-check

And when it comes to those three rules:

> **Disallow: /classifieds/search*/** Disallow anything inside a directory where the name starts with search, but not the directory itself

> **Disallow: /classifieds/search/** Disallow anything inside the directory named search

> **Disallow: /classifieds/search** Disallow any directory starting with search

Question 3

I dont have tested this myself, but did you try robots.txt checker? As for the difference between the three rules, I'd say that

Disallow: /classifieds/search*/ disallows all subdirectories of /classifieds/ beginning with "search"
Disallow: /classifieds/search/ only disallows the /classifieds/search/ directory
Disallow: /classifieds/search disallows the visit of a file called /classifieds/search