Question

I've been looking through the questions and got a better idea of my problem, but still, didn't find an answer.

I have a problem with regular expressions in PHP. I'm trying to get all the text in "alt" attributes of an HTML file. I'm taking into account all the possible tag names (img, input and area) and all kind of eventualities, like spaces and line breaks inbetween the characters (like <img alt = "Hello">). It must also be aware that the match string can be enclosed by single or double quotes and contain other (different) quote marks inside, for example: <img alt="Alan's picture"> or, <img alt='Example for the word "hello" in the text'>.

This is becoming difficult to me (I'm a beginner with regular expressions) so I'll just show you what I got. Note that I'm trying to use a backrefernce inside a character class, which I found to be a wrong practice (or so I think).

'/<\s*(?:img|input|area)\s[^>]*alt\s*=\s*("|\')([^\1>]*)\1[^>]*>/siU'

I've also seen in StackOverflow, some people recommending HTML parsers for stuff like this, but I'm worried about how much resources this practice may consume. Would you think this is a better idea? Thank you!

Was it helpful?

Solution

Absolutely you should use a parser. There are several reasons for this:

  • An HTML parser library can account for broken (or otherwise malformed) HTML that a regular expression will miss; for instance, some webpages will fail to escape quotes embedded in the alt attribute, such as alt='why can't I do this'
  • Parsers will be able to handle escaped characters automatically; for instance, alt="why&#32;the&#32;long&#32;space"
  • Additionally, it's probable that an HTML parser will offer speed and API advantages

You can perhaps check out the StackOverflow question Robust, Mature HTML Parser for PHP for some suggestions about what parsers would be worthwhile to use.

OTHER TIPS

Using a parser is definitely the way to go.

Regex are highly inappropriate for this type of tasks, and even Jon Skeet cannot parse HTML using regular expressions

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top