Question

I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.

Most of the time I've to capture structures like that:

<!-- comment -->
<tag1>lorem ipsum</tag>

<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->

In particular I want something like this:

/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi

but the \n\n doesn't look like it would work.

Is there a general line-break switch?

Was it helpful?

Solution

I think you could replace the \n\n with (\r?\n){2} this way you capture the CRLF pair instead of just the LF char.

OTHER TIPS

Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.

I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.

Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top