Help with a regex that strips out leading white space

https://stackoverflow.com/questions/1250382

12-09-2019
|

Question

I am modifying a core function of the Kohana library, the text::auto_p() function.

The function describes itself as "nl2br() on steroids". Essentially, it provides   single line breaks, but double line breaks are surrounded with the  tags.

The limitation I have found with it is that it will but  s in a <pre> element. This will create double new lines, which isn't what I want. I have made a modification to pick up pre elements with a regex, and a callback that will strip out the   which works alright.

However, the main problem is that I have code samples in my text that gets auto_p()'d, and I need to preserve the indentation (for readability). Unfortunately for me, the function strips leading and trailing white space on lines.

Here is the regex that strips leading space

$str = preg_replace('~^[ \t]+~m', '', $str);

I'm not the best regex guru, but I'm pretty sure that says "Get leading spaces and tabs where there is at least one and replace them with an empty string."

I have tried removing this line, but then it will add   where I definitely do not want them - in one case, I was getting output like this

<ul><br />
    <li>something</li>
</ul>

How would I modify this regex or code to not strip leading space inside of a <pre> element?

The original helper function from Kohana is available here. (scroll to the almost bottom).

I know I will get a few 'Use a HTML parser' type answers - and while you may be correct - the existing code simply uses regex, and I would prefer a simpler solution (where I don't have to include a library etc).

Thanks for your time.

Solution

Here's how I would do it:

$str = preg_replace(
    '~^[ \t]++(?=(?:[^<]++|<(?!/?+pre\b))*+(?:\z|<pre\b))~im',
    '', $str);

After matching some line-leading whitespace, the lookahead scans ahead for <pre> or </pre> tags. The meat of the lookahead is this bit:

(?:[^<]++|<(?!/?+pre\b))*+

It matches zero or more of anything that's not a left angle bracket, or a left angle bracket if it's not the beginning of a <pre> or </pre> tag. That part will only stop matching when it encounters a <pre> (starting) tag, a </pre> (ending) tag, or the end of the input. If it's an ending tag that stops it, you know you're inside a <PRE> element, so you don't want to do the replacement.

The possessive quantifiers ('++', '*+', and '?+') are essential to prevent catastrophic backtracking. (I can't help it: that phrase always makes me think of the resonance cascade scenario from Half-Life.)

This technique also assumes reasonably well-formed HTML, i.e., all <pre>...</pre> tags properly balanced. Tags inside of SGML comments will mess it up, too--unless they happen to be balanced. You can deal with comments, too, if you don't mind making the regex twice as long and three times as ugly. :)

OTHER TIPS

Your problem is discussed alot I guess - check out this link

http://us3.php.net/manual/en/function.nl2br.php#91828

This one as well:

http://us3.php.net/manual/en/function.nl2br.php#39641

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow