raspado html y consultas css

https://stackoverflow.com/questions/3603511

25-09-2019
|

Pregunta

¿Cuáles son las ventajas y desventajas de las siguientes bibliotecas?

De lo anterior, utilicé QP y no pude analizar HTML no válido, y simpleDomParser, que hace un buen trabajo, pero pierde memoria debido al modelo de objetos.Pero puedes mantener eso bajo control llamando $object->clear(); unset($object); cuando ya no necesitas un objeto.

¿Hay más raspadores?¿Cuáles son tus experiencias con ellos?Voy a hacer de este un wiki comunitario, tal vez podamos crear una lista útil de bibliotecas que puedan ser útiles al realizar scraping.

Hice algunas pruebas basadas en la respuesta de Byron:

    <?
    include("lib/simplehtmldom/simple_html_dom.php");
    include("lib/phpQuery/phpQuery/phpQuery.php");


    echo "<pre>";

    $html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
    $data['pq'] = $data['dom'] = $data['simple_dom'] = array();

    $timer_start = microtime(true);

    $dom = new DOMDocument();
    @$dom->loadHTML($html);
    $x = new DOMXPath($dom);

    foreach($x->query("//a") as $node)
    {
         $data['dom'][] = $node->getAttribute("href");
    }

    foreach($x->query("//img") as $node)
    {
         $data['dom'][] = $node->getAttribute("src");
    }

    foreach($x->query("//input") as $node)
    {
         $data['dom'][] = $node->getAttribute("name");
    }

    $dom_time =  microtime(true) - $timer_start;
    echo "dom: \t\t $dom_time . Got ".count($data['dom'])." items \n";






    $timer_start = microtime(true);
    $doc = phpQuery::newDocument($html);
    foreach( $doc->find("a") as $node)
    {
       $data['pq'][] = $node->href;
    }

    foreach( $doc->find("img") as $node)
    {
       $data['pq'][] = $node->src;
    }

    foreach( $doc->find("input") as $node)
    {
       $data['pq'][] = $node->name;
    }
    $time =  microtime(true) - $timer_start;
    echo "PQ: \t\t $time . Got ".count($data['pq'])." items \n";









    $timer_start = microtime(true);
    $simple_dom = new simple_html_dom();
    $simple_dom->load($html);
    foreach( $simple_dom->find("a") as $node)
    {
       $data['simple_dom'][] = $node->href;
    }

    foreach( $simple_dom->find("img") as $node)
    {
       $data['simple_dom'][] = $node->src;
    }

    foreach( $simple_dom->find("input") as $node)
    {
       $data['simple_dom'][] = $node->name;
    }
    $simple_dom_time =  microtime(true) - $timer_start;
    echo "simple_dom: \t $simple_dom_time . Got ".count($data['simple_dom'])." items \n";


    echo "</pre>";

y consiguió

dom:         0.00359296798706 . Got 115 items 
PQ:          0.010568857193 . Got 115 items 
simple_dom:  0.0770139694214 . Got 115 items

Solución

Solía usar html dom simple exclusivamente hasta que algunos SO'ers brillantes me mostraron la luz aleluya.

Simplemente use las funciones DOM integradas.Están escritos en C y forman parte del núcleo de PHP.Son más rápidos y eficientes que cualquier solución de terceros.Con Firebug, obtener una consulta XPath es muy sencillo.Este simple cambio ha hecho que mis raspadores basados en PHP se ejecuten más rápido, al tiempo que me ha ahorrado un tiempo precioso.

Mis raspadores solían tomar ~ 60 megabytes para raspar 10 sitios de forma asincrónica con curl.Eso fue incluso con la simple corrección de memoria dom html que mencionaste.

Ahora mis procesos php nunca superan los 8 megabytes.

Muy recomendable.

EDITAR

Bien, hice algunos puntos de referencia.El dom integrado es al menos un orden de magnitud más rápido.

Built in php DOM: 0.007061
Simple html  DOM: 0.117781

<?
include("../lib/simple_html_dom.php");

$html = file_get_contents("http://stackoverflow.com/search?q=favorite+programmer+cartoon");
$data['dom'] = $data['simple_dom'] = array();

$timer_start = microtime(true);

$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom); 

foreach($x->query("//a") as $node) 
{
     $data['dom'][] = $node->getAttribute("href");
}

foreach($x->query("//img") as $node) 
{
     $data['dom'][] = $node->getAttribute("src");
}

foreach($x->query("//input") as $node) 
{
     $data['dom'][] = $node->getAttribute("name");
}

$dom_time =  microtime(true) - $timer_start;

echo "built in php DOM : $dom_time\n";

$timer_start = microtime(true);
$simple_dom = new simple_html_dom();
$simple_dom->load($html);
foreach( $simple_dom->find("a") as $node)
{
   $data['simple_dom'][] = $node->href;
}

foreach( $simple_dom->find("img") as $node)
{
   $data['simple_dom'][] = $node->src;
}

foreach( $simple_dom->find("input") as $node)
{
   $data['simple_dom'][] = $node->name;
}
$simple_dom_time =  microtime(true) - $timer_start;

echo "simple html  DOM : $simple_dom_time\n";

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow