質問

I am have some HTML content, and I need to parse it, get all the images. Then print out the whole content but running a PHP class instance in every occurrence of the image

This is the content

<?php $content = 'Some text
<p>A paragraph</p>
<img src="image1.jpg" width="200" height="200">
More text
<img src="image2.jpg" width="200" height="200">'; ?>

I need to be able to get the images and run a class method with the output.

So the result would be something like

<?php echo 'Some text
<p>A paragraph</p>';

$this->Image('image1.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);
echo 'More text';
$this->Image('image2.jpg', PDF_MARGIN_LEFT, $y_offset, 116, 85);

But obviouly I imagine it would have to be a loop or something that does it automatically

役に立ちましたか?

解決

To convert the entire HTML snippet to TcPDF as you mentioned in your comment, you'll need to parse the snippet with DOMDocument and loop through each child node deciding how to handle them appropriately.

The catch with the snippet you've provided above is that it isn't a complete HTML document, thus DOMDocument will wrap it in <html> and <body> tags when parsing it, loading the following structure internally:

<html>
    <body>
        Some text
        <p>A paragraph</p>
        <img src="image1.jpg" width="200" height="200">
        More text
        <img src="image2.jpg" width="200" height="200">
    </body>
</html>

This caveat is easily worked around, however, by building on @hakre's answer in the thread I linked to below. My recommendation would be something along the lines of the following:

// Load the snipped into a DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($content);

// Use DOMXPath to retrieve the body content of the snippet
$xpath = new DOMXPath($doc);
$data = $xpath->evaluate('//html/body');

// <body> is now $data[0], so for readability we do this
$body = $data[0];

// Now we loop through the elements in your original snippet
foreach ($body->childNodes as $node) {
    switch ($node->nodeName) {
        case 'img':
            // Get the value of the src attribute from the img element
            $src = $node->attributes->getNamedItem('src')->nodeValue;
            $this->Image($src, PDF_MARGIN_LEFT, $y_offset, 116, 85);
            break;
        default:
            // Pass the line to TcPDF as a normal paragraph
            break;
    }
}

This way, you can easily add additional case 'blah': blocks to handle other elements which may appear in your $content snippets and handle them appropriately, and the content will be processed in the correct order without breaking the original flow of the text. :)

-- Original answer. Will work if you just want to extract the image sources and process them elsewhere independently of the rest of the content.

You can match all the <img> tags in your $content string by using a regular expression:

/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i

A live breakdown of the regex which you can play with to see how it works is here: http://regex101.com/r/tS5xY9

You can use this regex with preg_match_all() to retrieve all of the image tags from within your $content variable as follows:

$matches = array();
$num = preg_match_all('/<img(?:[\s\w="]+)src="([^"]+)"(?:[\s\w="]*)\/?>/i', $content, $matches, PREG_SET_ORDER);

The PREG_SET_ORDER constant tells preg_match_all() to store its results in a manner which is more easily looped through when producing output, as the first index on the array (i.e., $matches[0], $matches[1], etc) will contain the complete set of matches from the regular expression. In the case of the regex above, $matches[0] will contain the following:

array(
    0 => '<img src="image1.jpg" width="200" height="200">',
    1 => 'image1.jpg',
)

You can now loop through $matches as $key => $match and pass $match[1] to your $this->Image() method.

Alternatively, if you don't want to loop through, you can just access each src attribute directly from $matches as $matches[0][1], $matches[1][1], etc.

If you need to be able to access the other attributes within the tags, then I recommend using the DOMDocument method provided by @hakre on Get img src with PHP. If you just need to access the src attribute, then using preg_match_all() is faster and more efficient as it does not need to load the entire DOM of the snippet into memory as objects to provide you with the data you need.

他のヒント

You could build a lexer or parser to find out where the images are.

You're looking for two tokens at the beginning: <img and the respective closing >. A starting point for this could be something like this:

$text = "hello <img src='//first.jpg'> there <img src='//second.jpg'>";
$pos  = 0;

while (($opening = strpos($text, '<img', $pos)) !== FALSE) {

    // Find the next closing bracket's location
    $closing = strpos($text, '>', $opening);
    $length = ($closing - $opening) + 1; // Add one for the closing '>'

    $img_tag = substr($text, $opening, $length);

    var_dump($img_tag);

    // Update the loop position with our closing tag to advance the lexer
    $pos = $closing;
}

You're going to have to then build methods to scan for the img tags. You can also add your PDF methods in the loop, too.

Another more manageable approach could be to build a class that walks through each character. It'd first look for an opening '<' character, then check if the next three are 'img', and if so proceed to scan for the src, height, width attributes respectively. This is more work but is way more flexible – you'll be able to scan for much more than just your image tags.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top