¿Cómo extraer img src, title y alt de html usando php?

https://stackoverflow.com/questions/138313

02-07-2019
|

Pregunta

Me gustaría crear una página donde todas las imágenes que se encuentran en mi sitio web estén enumeradas con título y representación alternativa.

Ya escribí un pequeño programa para buscar y cargar todos los archivos HTML, pero ahora no sé cómo extraerlos. src, title y alt de este HTML:

<img src="/image/fluffybunny.jpg" título="Harvey the bunny" alternativo="a cute little fluffy bunny" />

Supongo que esto debería hacerse con alguna expresión regular, pero dado que el orden de las etiquetas puede variar y las necesito todas, realmente no sé cómo analizar esto de una manera elegante (podría hacerlo con el carácter duro manera char, pero eso es doloroso).

Solución

EDITAR: ahora que sé mejor

Usar regexp para resolver este tipo de problema es una mala idea y probablemente conducirá a un código poco confiable y que no se pueda mantener. Mejor utilice un analizador de HTML .

Solución con regexp

En ese caso, es mejor dividir el proceso en dos partes:

obtener toda la etiqueta img
extraer sus metadatos

Asumiré que su documento no es estricto con xHTML, por lo que no puede usar un analizador XML. P.EJ. con el código fuente de esta página web:

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

Luego obtenemos todos los atributos de la etiqueta img con un bucle:

Las expresiones regulares son intensivas en CPU, por lo que es posible que desee almacenar en caché esta página. Si no tiene un sistema de caché, puede modificar el suyo usando ob_start y cargar / guardar desde un archivo de texto.

¿Cómo funciona esto?

Primero, usamos preg_ match_ all , una función que hace que cada cadena coincida con el patrón y la supera en su tercer parámetro.

Las expresiones regulares:

<img[^>]+>

Lo aplicamos en todas las páginas web html. Se puede leer como cada cadena que comienza con & Quot; <img & Quot ;, contiene non & Quot; & Gt; & Quot; char y termina con > .

(alt|title|src)=("[^"]*")

Lo aplicamos sucesivamente en cada etiqueta img. Se puede leer como cada cadena que comienza con & Quot; alt & Quot ;, & Quot; title & Quot; o " src " ;, luego un " = " ;, luego un '" ', un montón de cosas que no son' " 'y termina con un' " '. Aísle las subcadenas entre () .

Finalmente, cada vez que quiera lidiar con expresiones regulares, es útil tener buenas herramientas para probarlas rápidamente. Verifique este probador de expresiones regulares en línea .

EDITAR: responde al primer comentario.

Es cierto que no pensé en las (con suerte pocas) personas que usan comillas simples.

Bueno, si usa solo ', simplemente reemplace todos los " por '.

Si mezclas ambos. Primero debe abofetearse :-), luego intente usar (& Quot; | ') en su lugar o & Quot; y [^ & # 248;] para reemplazar [^ "].

Otros consejos

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) {
       echo $tag->getAttribute('src');
}

Solo para dar un pequeño ejemplo del uso de la funcionalidad XML de PHP para la tarea:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

Utilicé el método DOMDocument::loadHTML() porque este método puede hacer frente a la sintaxis HTML y no obliga al documento de entrada a ser XHTML. Estrictamente hablando, la conversión a un SimpleXMLElement no es necesaria, solo hace que usar xpath y los resultados de xpath sean más simples.

Si es XHTML, su ejemplo es que solo necesita simpleXML.

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>

Salida:

object(SimpleXMLElement)#1 (1) {
  ["@attributes"]=>
  array(3) {
    ["src"]=>
    string(22) "/image/fluffybunny.jpg"
    ["title"]=>
    string(16) "Harvey the bunny"
    ["alt"]=>
    string(26) "a cute little fluffy bunny"
  }
}

El script debe editarse así

foreach( $result[0] as $img_tag)

porque preg_match_all devuelve una matriz de matrices

Puede usar simplehtmldom . La mayoría de los selectores jQuery son compatibles con simplehtmldom. A continuación se muestra un ejemplo

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

Utilicé preg_match para hacerlo.

En mi caso, tenía una cadena que contenía exactamente una etiqueta <img> (y ninguna otra marca) que obtuve de Wordpress e intentaba obtener el atributo src para poder ejecutarlo a través de timthumb.

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

En el patrón para tomar el título o la alt, simplemente puede usar $pattern = '/title="([^"]*)"/'; para tomar el título o <=> para tomar la alt. Lamentablemente, mi expresión regular no es lo suficientemente buena como para tomar los tres (alt / title / src) con una sola pasada.

Aquí hay una función PHP que colaboré con toda la información anterior para un propósito similar, es decir, ajustar las propiedades de ancho y largo de la etiqueta de la imagen sobre la marcha ... quizás un poco torpe, pero parece funcionar de manera confiable:

function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {

// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER); 

// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
    array_push($imagearray, $rawimagearray[$i][0]);
}

// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {

    preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}

// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {

    $ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
    $OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
    $OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);

    $NewWidth = $OrignialWidth; 
    $NewHeight = $OrignialHeight;
    $AdjustDimensions = "F";

    if($OrignialWidth > $MaximumWidth) { 
        $diff = $OrignialWidth-$MaximumHeight; 
        $percnt_reduced = (($diff/$OrignialWidth)*100); 
        $NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100)); 
        $NewWidth = floor($OrignialWidth-$diff); 
        $AdjustDimensions = "T";
    }

    if($OrignialHeight > $MaximumHeight) { 
        $diff = $OrignialHeight-$MaximumWidth; 
        $percnt_reduced = (($diff/$OrignialHeight)*100); 
        $NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100)); 
        $NewHeight= floor($OrignialHeight-$diff); 
        $AdjustDimensions = "T";
    } 

    $thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
    array_push($AllImageInfo, $thisImageInfo);
}

// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {

    if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
        $NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
        $NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);

        $thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
        array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
    }
}

// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
    $HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}

return $HTMLContent;

}

Aquí está LA solución, en PHP:

Simplemente descargue QueryPath y luego haga lo siguiente:

$doc= qp($myHtmlDoc);

foreach($doc->xpath('//img') as $img) {

   $src= $img->attr('src');
   $title= $img->attr('title');
   $alt= $img->attr('alt');

}

¡Eso es, ya terminaste!

He leído los muchos comentarios en esta página que se quejan de que el uso de un analizador dom es una sobrecarga innecesaria. Bueno, puede ser más costoso que una simple llamada regex, pero el OP ha declarado que no hay control sobre el orden de los atributos en las etiquetas img. Este hecho conduce a una convolución de patrón de expresiones regulares innecesarias. Más allá de eso, el uso de un analizador de dom proporciona los beneficios adicionales de legibilidad, mantenibilidad y conciencia de dom (regex no es consciente de dom).

Me encanta la expresión regular y respondo muchas preguntas sobre expresiones regulares, pero cuando se trata de HTML válido, rara vez hay una buena razón para expresar sobre un analizador.

En la demostración a continuación, vea cuán fácil y limpio DOMDocument maneja los atributos de la etiqueta img en cualquier orden con una mezcla de comillas (y sin comillas). Tenga en cuenta también que las etiquetas sin un atributo específico no son perjudiciales en absoluto: se proporciona una cadena vacía como valor.

Código: ( Demo )

$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;

libxml_use_internal_errors(true);  // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
    echo "IMG#{$i}:\n";
    echo "\tsrc = " , $img->getAttribute('src') , "\n";
    echo "\ttitle = " , $img->getAttribute('title') , "\n";
    echo "\talt = " , $img->getAttribute('alt') , "\n";
    echo "---\n";
}

Salida:

IMG#0:
    src = /image/fluffybunny.jpg
    title = Harvey the bunny
    alt = a cute little fluffy bunny
---
IMG#1:
    src = /image/pricklycactus.jpg
    title = Roger the cactus
    alt = a big green prickly cactus
---
IMG#2:
    src = /image/noisycockatoo.jpg
    title = Polly the cockatoo
    alt = an annoying white cockatoo
---
IMG#3:
    src = somethingelse
    title = something
    alt = 
---

El uso de esta técnica en código profesional te dejará con un script limpio, menos problemas con los que lidiar y menos colegas que deseen trabajar en otro lugar.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow