Question

Try to write a simple crawler method. When I use PHP curl to get the www.yahoo.com page, I fetch nothing. How can I fetch the page? My code is in the following.

public function getWebPage($url, $timeout = 120) {
    $options = array(
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_HEADER         => false,
            CURLOPT_FOLLOWLOCATION => true, 
            CURLOPT_ENCODING       => "",       
            CURLOPT_USERAGENT      => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.19) Gecko/20081216 Ubuntu/8.04 (hardy) Firefox/2.0.0.19",
            CURLOPT_AUTOREFERER    => true, 
            CURLOPT_CONNECTTIMEOUT => $timeout,
            CURLOPT_TIMEOUT        => $timeout,
            CURLOPT_MAXREDIRS      => 10,
    );

    $ch      = curl_init($url);
    curl_setopt_array($ch, $options);
    $content = curl_exec($ch);
    $err     = curl_errno($ch);
    $errmsg  = curl_error($ch);
    $header  = curl_getinfo($ch);
    curl_close($ch);

    return $content;
}
Was it helpful?

Solution

The yahoo.com runs on secure socket layer. So add this cURL param to your existing set.

CURLOPT_SSL_VERIFYPEER     => false,

and also disable the USERAGENT..

The working code.. (tested)

<?php

class A
{
    public function getWebPage($url, $timeout = 120) {
        $options = array(
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_HEADER         => false,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_ENCODING       => "",
            //CURLOPT_USERAGENT      => "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.19) Gecko/20081216 Ubuntu/8.04 (hardy) Firefox/2.0.0.19",
            CURLOPT_AUTOREFERER    => true,
            CURLOPT_CONNECTTIMEOUT => $timeout,
            CURLOPT_SSL_VERIFYPEER => false,
            CURLOPT_TIMEOUT        => $timeout,
            CURLOPT_MAXREDIRS      => 10,
        );

        $ch      = curl_init($url);
        curl_setopt_array($ch, $options);
        $content = curl_exec($ch);
        $err     = curl_errno($ch);
        $errmsg  = curl_error($ch);
        $header  = curl_getinfo($ch);
        curl_close($ch);

        return $content;
    }
}

$a = new A;
echo $a->getWebPage('www.yahoo.com');
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top