I will warn you that regex should not be used for HTML, because it is not a regular language. Instead, use a DOM manipulator like DOMDocument
. However, I will still answer your question.
The problem is with (.*)
being "greedy" not "lazy". Regular expressions attempt to match strings, meaning they will always match as much as they can. In this case .*
will match 0+
characters. This will go all the way to the end of the string and then start "backtracking" until it finds the next part of your expression (<h2|<div)
. If we make this capture group lazy ((.*?)
), then it will match 0+
characters until it finds the next part of your expression. This means it won't go to the end and backtrack.
I also made some modifications to the overall expression:
<h2>(.*?)</h2>(.*?)(?=<\w+>)
First I made both of our capture groups lazy, for the above reasons. Then I used a "lookahead" so that your last tag isn't unnecessarily matched. Finally, I used <\w+>
instead of <h2|<div
. This will be more flexible (\w
matches [a-zA-Z0-9_]
).