Question

I need to insert <p> tags to surround each list element in a HTML fragment. This must not create nested paragraphs, which is why i want to use lookahead/lookbehind assertions to detect if the content is already enclosed in a paragraph tag.

So far, i've come up with the following code.

This example uses a negative lookbehind assertion to match each </li> closing tag which is not preceeded by a </p> closing tag and arbitrary whitespace:

$html = <<<EOF
<ul>
        <li>foo</li>
        <li><p>fooooo</p></li>
        <li class="bar"><p class="xy">fooooo</p></li>
        <li>   <p>   fooooo   </p>   </li>
</ul>
EOF;
$html = preg_replace('@(<li[^>]*>)(?!\s*<p)@i', '\1<p>', $html);
$html = preg_replace("@(?<!</p>)(\s*</li>)@i", '</p>\1', $html);
echo $html, PHP_EOL;

which to my surprise results in the following output:

<ul>
    <li><p>foo</p></li>
    <li><p>fooooo</p></li>
    <li class="bar"><p class="xy">fooooo</p></li>
    <li>   <p>   fooooo   </p> </p>  </li>
</ul>

The insertion of the beginning tag works as expected, but note the additional </p> tag inserted in the last list element!

Can somebody explain why the whitespace (\s*) is totally ignored in the regex when a negative lookbehind assertion is used?

And even more important: what can i try else to achieve the mentioned goal?

Was it helpful?

Solution

Because the regex is not anchored in any way, it is free to be as loose as it likes.

In this case, let's look at how your string can be broken down. the square brackets indicate the attempted match.

... </p>[   </li>] // Fails, lookbehind assertion denies match
... </p> [  </li>] // Succeeds, lookbehind sees a space, not </p>

So you see the match succeeds simply by matching one less space, which is why you see a space between the two </p> in the result.

There's no easy fix for this in Regex. THE PONY HE COMES. So instead try using a parser.

$dom = new DOMDocument();
$dom->loadHTML($html);
$lis = $dom->getElementsByTagName('li');
foreach($lis as $li) {
    if( !$li->getElementsByTagName('p')->length) {
        $p = $dom->createElement("p");
        while($li->firstChild) $p->appendChild($li->firstChild);
        $li->appendChild($p);
    }
}
$output = $dom->saveHTML($dom->getElementsByTagName('body')->item(0));
$output = substr($output,strlen("<body>"),-strlen("</body>")); // strip body tag

OTHER TIPS

You have this:

</p>   </li>

And your regex doesn't match here:

</p>   </li>
    ^

because there's a </p> immediately preceding. But it DOES match here:

</p>   </li>
     ^

because the preceding text is not </p>, but .

You want an HTML parser. PHP comes with several, but I'm not much of a PHP dev so I can't recommend any in particular. See this question for some recommendations.

This might help.

$html = preg_replace('@(<li[^>]*>)([^</li>]+)(?!\s*<p)@i', '$1<p>$2</p>', $html);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top