如何提取img src,标题和alt从html使用php?

https://stackoverflow.com/questions/138313

02-07-2019
|

题

我想创建一个网页，所有的图像，其驻留在我的网站上列出的标题和替代代表。

我已经给我写了一个小小的程序以找到并载入所有HTML文件，但是现在我被卡在如何提取 src, title 和 alt 从这HTML：

<img src="/image/fluffybunny.jpg" 标题="Harvey the bunny" alt="a cute little fluffy bunny" />

我猜这应该做一些regex，但由于订单的标记可能不同，并且我需要所有他们，我真的不知道如何解析这一优雅的方式(我可以这样做的困难char通过焦炭的方式，但这是痛苦).

解决方案

编辑：现在我知道更好

使用regexp解决这种问题一个糟糕的主意很可能会导致在不可维护的和不可靠的编码。更好的使用 HTML分析器.

解决方案与regexp

在这种情况下这是最好的分裂进程分为两个部分：

得到所有img tag
提取他们的元数据

我会假设你的医生是不xHTML严格，所以你不能使用XML parser.E.G.与这个网页源代码:

/* preg_match_all match the regexp in all the $html string and output everything as 
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

然后我们获得的所有img tag的属性与一个循环：

Regexp是CPU密集型，所以你可以想到的高速缓存这一页。如果你有没有缓冲系统，可以调整自己的自由使用 ob_start 和装卸/储蓄从文本的文件。

怎么这些东西的工作？

第一，我们使用 preg_match_所有, 一功能得到每一个字符串匹配的模式和输出它在它的第三个参数。

该regexp:

<img[^>]+>

我们将它应用于所有html网页。它可以被解读为 每一个字符串的开始"<img"，包含非">"char和结束>.

(alt|title|src)=("[^"]*")

我们申请它依次对各个img tag.它可以被解读为 每一个字符串起与"替换"，"标题"或"来源"，然后"="，则'"'，一堆的东西，都不'"'结束的'"'.隔离子串之间().

最后，每一个你想要的时间来处理regexp，它方便有良好的工具，以迅速测试。检查这个在线regexp测试仪.

编辑：回答第一个评论。

这是真的，我并不认为关于(希望几个)人使用的单一的报价。

好吧，如果仅使用'的，只是替换所有的"通过'.

如果你混合这两者。首先，你应该拍自己:-)，然后试图使用("|')，或"与[^o]替换[^"].

其他提示

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) {
       echo $tag->getAttribute('src');
}

只是举一个使用PHP的XML功能来完成任务的小例子：

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

我确实使用了DOMDocument::loadHTML()方法，因为此方法可以处理HTML语法，并且不会强制输入文档为XHTML。严格地说，转换为SimpleXMLElement是没有必要的 - 它只是使用xpath并且xpath结果更简单。

如果它是XHTML，那么你的例子是，你只需要simpleXML。

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>

输出：

object(SimpleXMLElement)#1 (1) {
  ["@attributes"]=>
  array(3) {
    ["src"]=>
    string(22) "/image/fluffybunny.jpg"
    ["title"]=>
    string(16) "Harvey the bunny"
    ["alt"]=>
    string(26) "a cute little fluffy bunny"
  }
}

必须像这样编辑脚本

foreach( $result[0] as $img_tag)

因为preg_match_all返回数组数组

您可以使用 simplehtmldom 。 simplehtmldom支持大多数jQuery选择器。下面给出一个例子

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

我用preg_match来做。

在我的情况下，我有一个字符串，其中包含我从Wordpress获得的一个<img>标记（并且没有其他标记），我试图获取src属性，以便我可以通过timthumb运行它。

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

在获取标题或alt的模式中，您只需使用$pattern = '/title="([^"]*)"/';获取标题或<=>获取alt。可悲的是，我的正则表达式并不足以通过一次传递来获取所有三个（alt / title / src）。

这是一个PHP函数我为了类似的目的而从所有上述信息中蹒跚而行，即动态调整图像标签的宽度和长度属性...有点笨重，或许，但似乎可靠地工作：

function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {

// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER); 

// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
    array_push($imagearray, $rawimagearray[$i][0]);
}

// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {

    preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}

// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {

    $ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
    $OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
    $OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);

    $NewWidth = $OrignialWidth; 
    $NewHeight = $OrignialHeight;
    $AdjustDimensions = "F";

    if($OrignialWidth > $MaximumWidth) { 
        $diff = $OrignialWidth-$MaximumHeight; 
        $percnt_reduced = (($diff/$OrignialWidth)*100); 
        $NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100)); 
        $NewWidth = floor($OrignialWidth-$diff); 
        $AdjustDimensions = "T";
    }

    if($OrignialHeight > $MaximumHeight) { 
        $diff = $OrignialHeight-$MaximumWidth; 
        $percnt_reduced = (($diff/$OrignialHeight)*100); 
        $NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100)); 
        $NewHeight= floor($OrignialHeight-$diff); 
        $AdjustDimensions = "T";
    } 

    $thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
    array_push($AllImageInfo, $thisImageInfo);
}

// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {

    if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
        $NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
        $NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);

        $thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
        array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
    }
}

// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
    $HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}

return $HTMLContent;

}

以下是PHP中的解决方案：

只需下载QueryPath，然后执行以下操作：

$doc= qp($myHtmlDoc);

foreach($doc->xpath('//img') as $img) {

   $src= $img->attr('src');
   $title= $img->attr('title');
   $alt= $img->attr('alt');

}

就是这样，你已经完成了！

我已阅读此页面上的许多评论，抱怨使用dom解析器是不必要的开销。好吧，它可能比纯粹的正则表达式调用更昂贵，但是OP已经声明无法控制img标签中属性的顺序。这一事实导致不必要的正则表达式模式卷积。除此之外，使用dom解析器还提供了可读性，可维护性和dom感知的额外好处（正则表达式不是dom-aware）。

我喜欢正则表达式并且我回答了许多正则表达式的问题，但是在处理有效的HTML时，很少有理由在解析器上使用正则表达式。

在下面的演示中，看看DOMDocument以任何顺序处理img标签属性是多么容易和干净，并且混合了引用（根本没有引用）。另请注意，没有目标属性的标签根本不会造成中断 - 提供一个空字符串作为值。

代码：（演示）

$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;

libxml_use_internal_errors(true);  // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
    echo "IMG#{$i}:\n";
    echo "\tsrc = " , $img->getAttribute('src') , "\n";
    echo "\ttitle = " , $img->getAttribute('title') , "\n";
    echo "\talt = " , $img->getAttribute('alt') , "\n";
    echo "---\n";
}

输出：

IMG#0:
    src = /image/fluffybunny.jpg
    title = Harvey the bunny
    alt = a cute little fluffy bunny
---
IMG#1:
    src = /image/pricklycactus.jpg
    title = Roger the cactus
    alt = a big green prickly cactus
---
IMG#2:
    src = /image/noisycockatoo.jpg
    title = Polly the cockatoo
    alt = an annoying white cockatoo
---
IMG#3:
    src = somethingelse
    title = something
    alt = 
---

在专业代码中使用此技术将为您提供干净的脚本，更少的打嗝，以及希望您在其他地方工作的同事。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow