(1) Can each user agent have it's own crawl-delay?
Yes. Each record, started by one or more User-agent
lines, can have a Crawl-delay
line. Note that Crawl-delay
is not part of the original robots.txt specification. But it’s no problem to include them for those parsers that understand it, as the spec defines:
Unrecognised headers are ignored.
So older robots.txt parsers will simply ignore your Crawl-delay
lines.
(2) Where do you put the crawl-delay line for each user agent, before or after the Allow / Dissallow line?
Doesn’t matter.
(3) Does there have to be a blank like between each user agent group.
Yes. Records have to be separated by one or more new lines. See the original spec:
The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL).
(4) If I want to set all of the user agents to have crawl delay of 10 seconds, would the following be correct?
No. Bots look for records that match their user-agent. Only if they don’t find a record, they will use the User-agent: *
record. So in your example all the listed bots (like Googlebot
, MSNBot
, Yahoo! Slurp
etc.) will have no Crawl-delay
.
Also note that you can’t have several records with User-agent: *
:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
So parsers might look (if no other record matched) for the first record with User-agent: *
and ignore the following ones. For your first example that would mean that URLs beginning with /ads/
, /cgi-bin/
and /scripts/
are not blocked.
And even if you have only one record with User-agent: *
, those Disallow
lines are only for bots that have no other record match! As your comment # Block Directories for all spiders
suggest, you want these URL paths to be blocked for all spiders, so you’d have to repeat the Disallow
lines for every record.