Come utilizzare PHP per ottenere una pagina web in una variabile

https://stackoverflow.com/questions/692962

22-08-2019
|

Domanda

Vorrei scaricare una pagina dal web, è consentito di fare quando si sta utilizzando un semplice browser come Firefox, ma quando uso "file_get_contents" il server rifiuta e risponde che capisce il comando, ma non consentono tale download.

Quindi, cosa fare? Credo di aver visto in alcuni script (in Perl) un modo per rendere il vostro script come un vero e proprio navigatore con la creazione di un agente e biscotti utente, che rende i server pensare che lo script è un vero e proprio browser web.

Qualcuno ha un'idea su questo, come si può fare?

Soluzione

Usa CURL.

<?php
        // create curl resource
        $ch = curl_init();

        // set url
        curl_setopt($ch, CURLOPT_URL, "example.com");

        //return the transfer as a string
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);


        // set the UA
        curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');

        // Alternatively, lie, and pretend to be a browser
        // curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');

        // $output contains the output string
        $output = curl_exec($ch);

        // close curl resource to free up system resources
        curl_close($ch);     
?>

( http://uk.php.net/manual /en/curl.examples-basic.php )

Altri suggerimenti

Si, Curl è abbastanza buona per ottenere il contenuto della pagina. Io lo uso con le classi come DOMDocument e DOMXPath per macinare il contenuto in una forma utilizzabile.

function __construct($useragent,$url)
    {
        $this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
        $this->url=$url;


        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
        curl_setopt($ch, CURLOPT_URL,$url);
        curl_setopt($ch, CURLOPT_FAILONERROR, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 10);
        $html= curl_exec($ch);
        $dom = new DOMDocument();
        @$dom->loadHTML($html);
        $this->xpath = new DOMXPath($dom);
    }
...
public function displayResults($site)
$data=$this->path[0]->length;
    for($i=0;$i<$data;$i++)
    {   
    $delData=$this->path[0]->item($i);

    //setting the href and title properties 
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href'); 
                $titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;

    //setting the saves and additoinal
                  $saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
    if ($saves==NULL)
    {
        $saves=0;
    }

    //build the array
    $this->newSiteBookmark[$i]['source']='delicious.com';
    $this->newSiteBookmark[$i]['url']=$urlSite;
    $this->newSiteBookmark[$i]['title']=$titleSite;
    $this->newSiteBookmark[$i]['saves']=$saves;


                }

Quest'ultima è una parte di una classe che raschia i dati da delicious.com .Non molto legale però.

Questa risposta prende il tuo commento e la risposta di Rich in mente.

Il sito è probabilmente controllando se o non sei un vero utente che utilizza il referer HTTP o la stringa User Agent. provare a impostare questi per la vostra arricciatura:

 //pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
 //pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');

Un altro modo per farlo (anche se altri hanno fatto notare un modo migliore), è quello di utilizzare la funzione di PHP fopen (), in questo modo:

$handle = fopen("http://www.example.com/", "r");//open specified URL for reading

E 'particolarmente utile se cURL non è disponibile.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow