Regular expression to replace line feeds with a space only if the break is not in the contents of an HTML attribute

StackOverflow https://stackoverflow.com/questions/3992783

Question

I'm trying to write a regular expression that replaces line feeds between certain areas of a text file, but only on plain text content (i.e. excludes text inside HTML attribute contents, like href) but not having much luck past the first part.

Example input:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-
link-that-breaks">This is an example.</a> This is an example. This is yet another
example.
END CONTENT
COMMENTS: 0

Example output:

AUTHOR: Me
DATE: Now
CONTENT:
This is an example. This is another example. <a href="http://www.stackoverflow/example-link-that-breaks">This is an example.</a> This is an example. This is yet another example.
END CONTENT
COMMENTS: 0

So ideally, a space replaces line breaks if they occur in plain text, but removes them without adding a space if they are inside HTML parameters (mostly href, and I'm fine if I have to limit it to that).

Was it helpful?

Solution

This will remove newlines in attribute values, assuming the values are enclosed in double-quotes:

$s = preg_replace(
       '/[\r\n]+(?=[^<>"]*+"(?:[^<>"]*+"[^"<>]*+")*+[^<>"]*+>)/',
       '', $s);

The lookahead asserts that, between the current position (where the newline was found) and the next >, there's an odd number of double-quotes. This doesn't allow for single-quoted values, or for angle brackets inside the values; both can be accommodated if need be, but this is ugly enough already. ;)

After that, you can replace any remaining newlines with spaces:

$s = preg_replace('/[\r\n]+/', ' ', $s);

See it in action on ideone.com.

OTHER TIPS

Ideally you would use a real HTML parser (or XML it it was XHTML) and replace the attribute contents with that.

However, the following may do the trick if the engine supports positive lookbehind of arbitrary length:

(?<=\<[^<>]+=\s*("[^"]*|'[^']*))[\r\n]+

Usage: Replace all occurences of this regex with an empty string.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top