Question

I've been thinking a while about disallowing every crawler except Ask, Google, Microsoft, and Yahoo! from my site.

The reasoning behind this is that I've never seen any traffic being generated by any of the other web-crawlers out there.

My questions are:

  1. Is there any reason not to?
  2. Has anybody done this?
  3. Did you notice any negative effects?

Update:
Up till now I used the blacklist approach: if I do not like the crawler, I add them to the disallow list.
I'm no fan of blacklisting however as this is a never ending story: there are always more crawlers out there.

I'm no so much worried about the real ugly misbehaving crawlers, they are detected and blocked automatically. (and they typically do no ask for robots.txt anyhow :)

However, many crawlers are not really misbehaving in any way, they just do not seem to generate any value for me / my customers.
There are for example a couple of crawlers that power website who claim they will be The Next Google; Only Better. I've never seen any traffic coming from them and I'm quite sceptical about them becoming better than any of the four search engines mentioned above.

Update 2:
I've been analysing the traffic to several sites for some time now, and it seems that for reasonable small sites, 100 unique human visitors a day (=visitors that I cannot identify as being not human). About 52% of the generated traffic is by automated processes.

60% of all automated visitors is not reading robots.txt, 40% (21% of total traffic) does request robots.txt. (this includes Ask, Google, Microsoft, and Yahoo!)

So my thinking is, If I block all the well behaved crawlers that do not seem to generate any value for me, I could reduce the bandwidth use and server load by around 12% - 17%.

Was it helpful?

Solution

The internet is a publishing mechanism. If you want to whitelist your site, you're against the grain, but that's fine.

Do you want to whitelist your site?

Bear in mind that badly behaved bots which ignore robots.txt aren't affected anyway (obviously), and well behaved bots are probably there for a good reason, it's just that that's opaque to you.

OTHER TIPS

Whilst other sites that crawl your sites might not be sending any content your way, its possible that they themselves are being indexed by google et al, and so adding to your page rank, blocking them from your site might affect this.

Is there any reason not to?

Do you want to be left out of something which could be including your site which you have no knowledge of and is indirectly bringing a lot of content your way.

If some strange crawlers are hammering your site and eating your bandwidth you may want to, but it is quite possible that such crawlers wouldn’t honour your robots.txt either.

Examine your log files and see what crawlers you have and what proportion of your bandwidth they are eating. There may be more direct ways to block traffic which is bombarding your site.

This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory.

My only worry is that you may miss the next big thing.

There was a long period where AltaVista was the search engine. Possibly even more than Google is now. (there was no bing, or Ask, and Yahoo was a directory, rather than a search engine as such). Sites that blocked all but Altavista back then would have never seen traffic from Google, and therefore never known how popular it was getting, unless they heard about it from another source, which might have put them at a considerable disadvantage for a while.

Pagerank tends to be biased towards older sites. You don't want to appear newer than you are because you were blocking access via robots.txt for no reason. These guys: http://www.dotnetdotcom.org/ may be completely useless now, but maybe in 5 years time, the fact that you weren't in their index now will count against you in the next big search engine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top