Question

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.

Some <foo@bar.com> Content--> Some Content
Some <Content which looks like this --> Some 

Is there a way I can get around this?

Was it helpful?

Solution

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.

A simple 80% solution would be to look for all known tags and strip them.

RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')

The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

OTHER TIPS

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top