What are the known or expected impact of using Php/Querypath crawler on a target web server, and how can it be kept to a minimum?

https://stackoverflow.com/questions/9901021

27-05-2021
|

Domanda

I'm building a php+querypath crawler to prototype an idea. I'm worried that once I run it, the target site might be affected in some way, since it has a large number of relevant pages I want to scrape -- 1361 pages at the moment.

What are the recommendations to keep the impact to a minimum on the target site?

Soluzione

Since you are building a crawler the only impact you can have on the target website is, using up their bandwidth.

To keep the impact minimum, you can do the following:
1. While building your crawler, download a sample page of the target site on your computer and test your script on that copy.
2. Ensure that loop which is running to scrape the 1361 pages is functioning properly and downloading each page only once.
3. Ensure that your script is downloading only 1 page at a time and optionally include an interval between each fetch so that there is less load on the target server.
4. Depending on how heavy each page is you can decide to download the entire 1361 pages over hours/days/months.

Altri suggerimenti

QueryPath itself will issue vanilla HTTP requests -- nothing fancy at all. 1361 is not necessarily a large number, either.

I would suggest running your crawl in a loop, grabbing some number of pages (say, 10) in a row, sleeping for several seconds, and then grabbing another ten. Assuming $urls is an array of URLs you could try something like this:

$count = count($urls);
$interval = 10; // Every ten times...
$wait = 2; // Wait two seconds.
for ($i = 0; $i < $count; ++$i) {
  // Do whatever you're going to do with QueryPath.
  $qp = qp($url);
  if ($i > 0 && $i % $interval == 0) {
    sleep($wait);
  }
}

As the previous poster suggests, test with a smaller number of URLs, then go up from there.

Here are a few other tips:

The robots.txt file of the remote site sometimes suggests how long a crawler should wait (Crawl Delay) between crawls. This, if set, is a good indicator for what your $wait variable should be set to.
Hitting the site off-peak (e.g. at 2AM local time) will minimize the chances that the remote site is flooded with requests.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow