Help with a regex that strips out leading white space
Question
I am modifying a core function of the Kohana library, the text::auto_p() function.
The function describes itself as "nl2br() on steroids". Essentially, it provides <br />
single line breaks, but double line breaks are surrounded with the <p>
tags.
The limitation I have found with it is that it will but <br />
s in a <pre>
element. This will create double new lines, which isn't what I want. I have made a modification to pick up pre elements with a regex, and a callback that will strip out the <br />
which works alright.
However, the main problem is that I have code samples in my text that gets auto_p()
'd, and I need to preserve the indentation (for readability). Unfortunately for me, the function strips leading and trailing white space on lines.
Here is the regex that strips leading space
$str = preg_replace('~^[ \t]+~m', '', $str);
I'm not the best regex guru, but I'm pretty sure that says "Get leading spaces and tabs where there is at least one and replace them with an empty string."
I have tried removing this line, but then it will add <br />
where I definitely do not want them - in one case, I was getting output like this
<ul><br />
<li>something</li>
</ul>
How would I modify this regex or code to not strip leading space inside of a <pre>
element?
The original helper function from Kohana is available here. (scroll to the almost bottom).
I know I will get a few 'Use a HTML parser' type answers - and while you may be correct - the existing code simply uses regex, and I would prefer a simpler solution (where I don't have to include a library etc).
Thanks for your time.
Solution
Here's how I would do it:
$str = preg_replace(
'~^[ \t]++(?=(?:[^<]++|<(?!/?+pre\b))*+(?:\z|<pre\b))~im',
'', $str);
After matching some line-leading whitespace, the lookahead scans ahead for <pre>
or </pre>
tags. The meat of the lookahead is this bit:
(?:[^<]++|<(?!/?+pre\b))*+
It matches zero or more of anything that's not a left angle bracket, or a left angle bracket if it's not the beginning of a <pre>
or </pre>
tag. That part will only stop matching when it encounters a <pre>
(starting) tag, a </pre>
(ending) tag, or the end of the input. If it's an ending tag that stops it, you know you're inside a <PRE>
element, so you don't want to do the replacement.
The possessive quantifiers ('++'
, '*+'
, and '?+'
) are essential to prevent catastrophic backtracking. (I can't help it: that phrase always makes me think of the resonance cascade scenario from Half-Life.)
This technique also assumes reasonably well-formed HTML, i.e., all <pre>...</pre>
tags properly balanced. Tags inside of SGML comments will mess it up, too--unless they happen to be balanced. You can deal with comments, too, if you don't mind making the regex twice as long and three times as ugly. :)
OTHER TIPS
Your problem is discussed alot I guess - check out this link
http://us3.php.net/manual/en/function.nl2br.php#91828
This one as well: