Domanda

I'm trying to understand preg_match_all in php. A friend and I run a small site for fun with a few friends, to practice coding mostly, and we added a section a while back that contains the code to strip any source of it's images:

$html = file_get_contents('http://www.anyrandomwebsite.com');
preg_match_all('/<img[^>]+>/i',$html, $result);

which we pretty much just found online, and couldn't make too much sense of it, but I understand that it finds any instances of image tags on a page and puts them into an array.

Now, I'm trying to create a code that searches a source for any links on a page (so anything starting with 'http') and preferably only something that ends in a specific extension (i.e. .net, or .zip)

But, I can't figure out how to write the pattern. I've tried learning Regex, but based on what my friend's have told me, the code used to find image tags doesn't follow normal rules, and they don't fully understand it either.

Basically, I am looking for someone to please write a preg_match_all that can find links on a page, and then to please explain to me WHY it works, and also explain how the above code works (preferably, character by character in the pattern part)

Thank you very much to anyone who responds to this!

È stato utile?

Soluzione

To explain the regex you have:

/      # Starting regex delimiter
<img   # Match <img
[^>]+  # Match one or more characters that aren't a >
>      # Match a >
/      # Ending regex delimiter
i      # Case-insensitive option

How does it work?

Imagine what an img tag looks like. It starts with <img and ends with >. So once we've identified an <img tag, we need to match everything until the nearest >.

That means we need to match as many characters as we can, as long as they are not a >. And that's exactly what [^>]+ does. Since there needs to be at least one of those characters (<img> is not legal), we use a + instead of the "zero or more" *.

You might see a problem here: What if the tag does contain a > somewhere, e. g. in an attribute? And there you have one of the reasons why using regexes to parse HTML is fraught with peril.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top