Question

I'm parsing image links from external webpages in my php script. This is my pattern:

$pattern = '/<img[^<>]+?src=["\']([^<>]+?)["\']/';

I found tags like this in some pages:

<img class="avatar-32" src="<%= avatar %>" />

That's why the [^<>] And I don't know how to get other potencial error tags

So I wanted to know, how to perfect my pattern to accept just the valid img tags.

There are questions like:

  1. Can there be spaces between src and = and " ?
  2. Between ´<´ and img ?
  3. Even newlines?
  4. What if I find a ' in src attribute?

In fact how browsers parse links?

Note: I didn't add extensions because the links can be:

http://www.example.com/img.jpg?1234
http://www.example.com/img.php
http://www.example.com/img/

Also I have a relative to absolute link converter. So the conversion is not the problem

Was it helpful?

Solution

You better use DOMDocument. It has many and useful functions to find links, textContent, manipulate dom and more.

For example to get the urls of images:

$dom = new DOMDocument;
@$dom->loadHTML($response); //I assume that you're reading/curling pages

foreach ($dom->getElementsByTagName('img') as $node) {
    if ($node->hasAttribute('src')) {
        $url = $node->getAttribute('src');
        //Also you can do some regex here to validate urls 
        //and bypass those like "<%= avatar %>"
        echo $url,'<br>';
    }
}       

These methods can also be very usefull

$node->nodeValue //To get the textContent of the node
$node->childNodes //To get the children of the node. It will return a nodelist object 
                  //as getElementsByTagName('img')
$node->nodeType // Some nodes returned when calling childNodes are textnodes
                //so they can be bypassed with a conditional:
                //if( $node->nodeType == 1){//It's an element node}

$nodes->length // length of a nodelist object 
$nodes->item(1) // 2nd node of a nodelist
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top