php pcre - how to manage position?

https://stackoverflow.com/questions/22991124

PHP
pcre

01-07-2023
|

Вопрос

<?php
$s = "<h2>1</h2>unstructed string1<h2>2</h2>unstructed string2<div>some data</div>";

preg_match_all("`<h2>(.*)</h2>(.*)(<h2|<div)`isU", $s, $m);

var_export($m);

?>

result is 
array (
  0 => 
  array (
    0 => '<h2>1</h2>unstructed string1<h2',
  ),
  1 => 
  array (
    0 => '1',
  ),
  2 => 
  array (
    0 => 'unstructed string1',
  ),
  3 => 
  array (
    0 => '<h2',
  ),
)

My aim is to find data between h2 and data after h2 tag. Only fist h2 related data found.

Data after h2 tag sometime ends by div sometime by h2 tags.

As I understand in second case internal prce positon move inside h2 tag and therefore php do not find second h2 tag.

Решение

I will warn you that regex should not be used for HTML, because it is not a regular language. Instead, use a DOM manipulator like DOMDocument. However, I will still answer your question.

The problem is with (.*) being "greedy" not "lazy". Regular expressions attempt to match strings, meaning they will always match as much as they can. In this case .* will match 0+ characters. This will go all the way to the end of the string and then start "backtracking" until it finds the next part of your expression (<h2|<div). If we make this capture group lazy ((.*?)), then it will match 0+ characters until it finds the next part of your expression. This means it won't go to the end and backtrack.

I also made some modifications to the overall expression:

<h2>(.*?)</h2>(.*?)(?=<\w+>)

First I made both of our capture groups lazy, for the above reasons. Then I used a "lookahead" so that your last tag isn't unnecessarily matched. Finally, I used <\w+> instead of <h2|<div. This will be more flexible (\w matches [a-zA-Z0-9_]).

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow