Cómo utilizar PHP para obtener una página web en una variable

https://stackoverflow.com/questions/692962

22-08-2019
|

Pregunta

Quiero descargar una página de la web, se le permite hacer cuando se está utilizando un simple navegador como Firefox, pero cuando se utiliza "file_get_contents" el servidor se niega y responde que entiende el comando, pero no permiten tales descargas.

¿Qué hacer? Creo que he visto en algunos scripts de Perl (en) una manera de hacer que su escritura como un navegador real mediante la creación de un agente y las cookies del usuario, lo que hace que los servidores piensan que su guión es un navegador web real.

¿Alguien tiene una idea acerca de esto, ¿cómo se puede hacer?

Solución

Uso CURL.

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "example.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);


        // set the UA
        curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');

        // Alternatively, lie, and pretend to be a browser
        // curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);     
?>

( http://uk.php.net/manual /en/curl.examples-basic.php )

Otros consejos

Sí, rizo es bastante bueno en conseguir contenido de la página. Lo uso con clases como DOMDocument y DOMXPath para moler el contenido a una forma utilizable.

function __construct($useragent,$url)
    {
        $this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
        $this->url=$url;


        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html= curl_exec($ch);
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $this->xpath = new DOMXPath($dom);
    }
...
public function displayResults($site)
$data=$this->path[0]->length;
    for($i=0;$i<$data;$i++)
    {   
    $delData=$this->path[0]->item($i);

    //setting the href and title properties 
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href'); 
                $titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;

    //setting the saves and additoinal
                  $saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
    if ($saves==NULL)
    {
        $saves=0;
    }

    //build the array
    $this->newSiteBookmark[$i]['source']='delicious.com';
    $this->newSiteBookmark[$i]['url']=$urlSite;
    $this->newSiteBookmark[$i]['title']=$titleSite;
    $this->newSiteBookmark[$i]['saves']=$saves;


                }

Esta última es una parte de una clase que raspa datos de delicious.com .No muy legal sin embargo.

Esta respuesta lleva su comentario a la respuesta de Rich en cuenta.

El sitio es, probablemente, la comprobación de si o no usted es un usuario real usando el árbitro HTTP o la cadena de agente de usuario. intente configurar estos para su rizo:

 //pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
 //pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');

Otra manera de hacerlo (aunque otros han señalado una mejor manera), es utilizar la función de PHP fopen (), así:

$handle = fopen("http://www.example.com/", "r");//open specified URL for reading

Es especialmente útil si la curvatura no está disponible.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow