Question

I'd like to know how can be scraped in a loop (page 1 page 2etc....) a webpage which has infinite loops (like imgur) for example ... ?

I tried the code below, but it returns only the first page. How can I trigger the next page due to infinite scrolling template?

<?php
    $mr = $maxredirect === null ? 10 : intval($maxredirect);
    if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
        curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    } else {
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

        if ($mr > 0) {
            $original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
            $newurl = $original_url;
            $rch = curl_copy_handle($ch);

            curl_setopt($rch, CURLOPT_HEADER, true);
            curl_setopt($rch, CURLOPT_NOBODY, true);
            curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
            do {
                curl_setopt($rch, CURLOPT_URL, $newurl);
                $header = curl_exec($rch);
                if (curl_errno($rch)) {
                    $code = 0;
                } else {
                    $code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
                    if ($code == 301 || $code == 302) {
                        preg_match('/Location:(.*?)\n/', $header, $matches);
                        $newurl = trim(array_pop($matches));

                        // if no scheme is present then the new url is a
                        // relative path and thus needs some extra care
                        if(!preg_match("/^https?:/i", $newurl)){
                            $newurl = $original_url . $newurl;
                        }
                    } else {
                        $code = 0;
                    }
                }
            } while ($code && --$mr);
            curl_close($rch);
            if (!$mr) {
                if ($maxredirect === null)
                    trigger_error('Too many redirects.', E_USER_WARNING);
                else
                    $maxredirect = 0;
                return false;
            }
            curl_setopt($ch, CURLOPT_URL, $newurl);
        }
    }
    return curl_exec($ch);
}

$ch = curl_init('http://www.imgur.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec_follow($ch);
curl_close($ch);

echo $data;
?>
Was it helpful?

Solution

cURL works by getting the source code of a webpage. Your code will gather the HTML only from the original webpage. In the case of imgur, it will include ~40 images, plus the rest of the page layout.

This original source code doesn't change when you scroll down. However, the HTML inside of your browser does. This is done with AJAX. The page that you are looking at requests information from a second page.

If you use FireBug (for FireFox) or Google Chrome's page inspector, then you can monitor these requests by going to the Net or Network tab (respectively). When you scroll down, the page will make another ~45 requests or so (mostly for images). You'll also see that it requests this page:

http://imgur.com/gallery/hot/viral/day/page/0?scrolled&set=1

The JavaScript on the imgur homepage appends this HTML to the bottom of the home page. You would probably want to query this page (or the API, as the other Chris said) if you want to get a list of images. You can play with the numbers at the end of the URL to get more images.

OTHER TIPS

Page scraping is seldom the best approach for reasons exactly like this. Imgur offers an API which accomplishes the tasks I assume you're attempting without using any hacky scraping.

If you're married to the idea of scraping, you'll have to do some research. Instead of scraping only the main page, you'll need to note the API used by the AJAX request, you can call directly to that and continue to scrape subsequent pages of data. The specifics of this approach are beyond the scope of this answer, especially considering that there is an established API available.

Related Reading

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top