Question

I crawl some data from the web, because there is no API. Unfortunately, it's quite a lot of data from several different sites and I quickly learned I can't just make thousands of requests to the same site in a short while... I want to approach the data as fast as possible, but I don't want to cause a DOS attack :)

The problem is, every server has different capabilities and I don't know them in advance. The sites belong to my clients, so my intention is to prevent any possible downtime caused by my script. So no policy like "I'll try million requests first and if it fails, I'll try half million, and if it fails..." :)

Is there any best practice for this? How Google's crawler knows how many requests it can do in the same while to the same site? Maybe they "shuffle their playlist", so there are not as many concurrent requests to a single site. Could I detect this stuff somehow via HTTP? Wait for a single request, count response time, approximately guess how well balanced the server is and then somehow make up a maximum number of concurrent requests?

I use a Python script, but this doesn't matter much for the answer - just to let you know in which language I'd prefer your potential code snippets.

Was it helpful?

Solution

The google spider is pretty damn smart. On my small site it hits me 1 page per minute to the second. They obviously have a page queue that is filled keeping time and sites in mind. I also wonder if they are smart enough about not hitting multiple domains on the same server -- so some recognition of IP ranges as well as URLs.

Separating the job of queueing up the URLs to be spidered at a specific time from the actually spider job would be a good architecture for any spider. All of your spiders could use the urlToSpiderService.getNextUrl() method which would block (if necessary) unless the next URL is to be spidered.

I believe that Google looks at the number of pages on a site to determine the spider speed. The more pages that you have the refresh in a given time then the faster they need to hit that particular server. You certainly should be able to use that as a metric although before you've done an initial crawl it would be hard to determine.

You could start out at one page every minute and then as the pages-to-be-spidered for a particular site increases, you would decrease the delay. Some sort of function like the following would be needed:

 public Period delayBetweenPages(String domain) {
     take the number of pages in the to-do queue for the domain
     divide by the overall refresh period that you want to complete in
     if more than a minute then just return a minute
     if less than some minimum then just return the minimum
 }

Could I detect this stuff somehow via HTTP?

With the modern internet, I don't see how you can. Certainly if the server is returning after a couple of seconds or returning 500 errors, then you should be throttling way back but a typical connection and download is sub-second these days for a large percentage of servers and I'm not sure there is much to be learned from any stats in that area.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top