Can you help with a regular expression or function to remove HTML encoded tags?

https://stackoverflow.com/questions/628493

06-07-2019
|

Question

I need a regex or function that can remove the ENCODED HTML tags from a database record. I have text in a database that is being stored (from TinyMCE) as encoded HTML.

The code has the 'less than'; and 'greater than'; tags encoded.

I would like to remove all the encoded tags and HTML and just leave the plain text and spaces only.

Solution

I'd avoid a reg ex here, as coming up with something that can cover any and all HTML that a user might foist on you is a task that could keep a full-time employee permanently busy.

Instead, a two stop approach that relies on already present PHP functionality is a better choice.

First, let's turn the encoded HTML entities back into greater than and less than signs with htmlspecialchars_decode.

$string = htmlspecialchars_decode($string);

This should give us a string of proper html. (If your quotes are still encoded, see the second argument in the linked documentation).

To finish, we'll strip out the HTML tags with the PHP function strip_tags. This will remove any and all HTML tags from the source.

$string = strip_tags($string);

Wrapped in a function/method

function decodeAndStripHTML($string){
    return strip_tags(htmlspecialchars_decode($string));
}

OTHER TIPS

Sounds like you'll need to translate < to < and > to > and then use an HTML parser to extract the text (the latter can't/shouldn't be done with regular expressions).

You might also be interested by this library called HTML Purifier.

They say, and I quote:

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications. Tired of using BBCode due to the current landscape of deficient or insecure HTML filters? Have a WYSIWYG editor but never been able to use it? Looking for high-quality, standards-compliant, open-source components for that application you're building? HTML Purifier is for you!

Remove HTML regex

In response to Alan Storm: I unfortunately was that full-time employee (haha) for a web application that used JavaScript validation.

Here is the JavaScript regex that I wrote. I am sure it could be cleaned up:

var regex =

/(&#[0-9];)|(&[A-Za-z0-9];)|(<[/]?[A-Za-z0-9 =/.:;,!@#$%^&*"'?|_{}\~`()+-]+[/]?>)/g;

Where [ d or $amp; or or ] or ANYTHING inside angle brackets was a match, highlighted, and eventually removed for the user.

-Side Note: I don't believe in thinking for the user, but this regex was required.

<.*?>

I usually use this \s*?<.*?>\s*? to match all html tags. To remove tags encoded to entities you could use \s*?<.*?>\s*?

The \s matches white-space, . (dot) matches any character, * means zero or more ocurrences of the previous character, ? combined with * makes the * lazy (ungreedy).

Depending on the language you're using, you might have to add extra backslashes for the expression to work. If I'm not mistaken, PHP needs a second backslash, like this \\s*?<.*?>\\s*?

However, if the text contains greater and lesser than characters that are not html tags, (a math equation for instance) you will run into problems. In this case, you need a more sophisticated and less straightforward regex.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow