Question

We can tell bots to crawl or not to crawl our website in robot.txt. On the other hand, we can control the crawling speed in Google Webmasters (how much Google bot crawls the website). I wonder if it is possible to limit the crawler activities by robots.txt

I mean accepting bots to crawl pages but limit their presence by time or pages or size!

Was it helpful?

Solution

Not that I have found. Robots.txt is a place to put directories or files you would like bots to include or exclude. If there was a way it is not standard yet. Remember that whoever creates the bots chooses whether or not to respect robots.txt, not all bots ("bad bots") respect this file.

Currently if there were settings to reduce crawl speed, time on site, etc. it would be on a bot by bot basis and not standardized into robots.txt values.

More info: http://www.robotstxt.org/robotstxt.html

OTHER TIPS

There is one directive you can use in robots.txt, it's "Crawl-delay".

Crawl-delay: 5

Meaning robots should be crawling no more than one page per 5 seconds. But this directive is not officially supported by robots.txt, as much as I know.

Also there are some robots that don't really take in count robots.txt file at all. So even if you have disallowed access to some pages, they still may get crawled by some robots, of course not the largest ones like Google.

Baidu for example could ignore robots.txt, but that's not for sure.

I've got no official source for this info, so you can just Google it.

I know this is a really old question, but I wanted to add that according to the google documentation here is the official answer:

You can generally adjust the crawl rate setting in your Google Webmaster Tools account.

per: https://developers.google.com/webmasters/control-crawl-index/docs/faq#h04

From within webmaster-tools you can follow these steps:

  1. On the Search Console Home page, click the site that you want.

  2. Click the gear icon , then click Site Settings.

  3. In the Crawl rate section, select the option you want and then limit the crawl rate as desired.

The new crawl rate will be valid for 90 days.

ref: google support question

No, the robots.txt file can only specify which pages you don't want to be indexed and what user agents those rules apply too. You can't do anything else with the file.

Some websites use the Allow and Sitemap directives, but they do not appear to be valid directives according to the official website, even though some crawlers may respect them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top