Why my PHP QueryPath 2.1.2 WAMP scraping script only returns 5 articles instead of 43? Timeout?

StackOverflow https://stackoverflow.com/questions/15012307

  •  10-03-2022
  •  | 
  •  

Domanda

I am trying to scrape 43 blogs posts from my blog and store them in array but when I print_r the array it only returns first 5 [with the rest empty] instead of all 43. Why? And How I can get all 43? I run this script from cmd.exe [command line] on WAMP.

    <?php

require 'src/QueryPath/QueryPath.php';


$qp1 = htmlqp('http://myblog.com/blog');
$qp2 = htmlqp('http://myblog.com/blog/Page-2.html');
$qp3 = htmlqp('http://myblog.com/blog/Page-3.html');
$qp4 = htmlqp('http://myblog.com/blog/Page-4.html');

foreach ($qp1->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp2->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp3->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}

foreach ($qp4->find('ol>li a[href],.jbReadon') as $item) {
    $links[] = $item->attr('href');
}


print_r($links);



foreach ($links as $link) {
    $url = "http://myblog.com".$link;

    $content[] = htmlqp($url)->find('.jbIntroText p')->text();
}
print_r($content);




?>

after key 5 of the array onwards, all the values are empty. [I couldnt upload the image either from laptop or web so heres the link to screenshot of cmd.exe] http://img546.imageshack.us/img546/6092/cmdafter5arrayisempty.jpg

I am obviously a beginner so any suggestions how to make this code more succint or how to better accomplish my scraping prototype would be appreciated. All constructive criticism welcome as well :-P

È stato utile?

Soluzione

You might want to add some print statements to at least one of those FOR loops. Several things could be going on here. The two most likely are:

  • The filter may only be matching five items.
  • The HTML parser may be choking on some markup. In this case, it will attempt to load as much of the HTML DOM as it can.

By adding in some print statements, you might be able to see how many times it is iterating.

And as an aside, if you're trying to get the list of articles on your blog, reading the RSS or Atom feed might be easier (though I suppose it might not have all the info you need).

Altri suggerimenti

I have solved my problem!! Apparently, all I needed was a time delay between each query/scrape cause my blog was protecting itself against massive scrapings or whatever. All I had to do is to rewrite the 2nd part of the code like this:

foreach ($links as $link) {
    $url = "http://myblog.com".$link;
    $count = count($links);
    $interval = 2; // Every three times...
    $wait = 2; // Wait two seconds.
        for ($i = 0; $i < $count; ++$i) {
        $content[] = htmlqp($url)->find('.jbIntroText p')->text();
        print_r($content);
            if ($i > 0 && $i % $interval == 0) {
            sleep($wait);
            }

        }
}

Thanks Technosophos for the idea here What are the known or expected impact of using Php/Querypath crawler on a target web server, and how can it be kept to a minimum?

Also thanks for the idea that I should convert blog im about to scrape to RSS/Atom Feed, since alot of the times blogs dont have their own RSS Feed generated

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top