Strip only valid html

https://stackoverflow.com/questions/17743811

03-06-2022
|

Question

I'm trying to strip HTML tags from a piece of text. However the trouble is that whatever I use - regex, strip_tags etc.. Comes up across the same problem: It will also strip text which is not HTML but looks like it.

Some <foo@bar.com> Content--> Some Content
Some <Content which looks like this --> Some

Is there a way I can get around this?

Solution

A fully correct solution would be a full-fledged HTML parser. See this legendary question for a full discussion.

A simple 80% solution would be to look for all known tags and strip them.

RegExp('</?(a|b|blockquote|cite|dd|dl|dt|...|u)\b.*?>')

The code would be more readable if you use an array of tags and build expressions as you loop through them. It will not handle comments nicely, so if you need more than hack quality, don't do it with a hack approach. If you need correctness, use an actual HTML parser (e.g. DOMDocument in PHP).

OTHER TIPS

Have you tried the HTML purifier library? You can configure it to strip all tags out, I've found the library very reliable.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow