How can I clean HTML tags out of a ColdFusion string?

https://stackoverflow.com/questions/970817

13-09-2019
|

Question

I am looking for a quick way to parse HTML tags out of a ColdFusion string. We are pulling in an RSS feed, that could potentially have anything in it. We are then doing some manipulation of the information and then spitting it back out to another place. Currently we are doing this with a regular expression. Is there a better way to do this?

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
  <cfset myFeed.item[i].description.value = 
   REReplaceNoCase(myFeed.item[i].description.value, '<(.|\n)*?>', '', 'ALL')>
</cfloop>

We are using ColdFusion 8.

Solution

Disclaimer I am a fierce advocate of using a proper parser (instead of regex) to parse HTML. However, this question isn't about parsing HTML, but about destroying it. For all tasks that go beyond that, use a parser.

I think your regex is good. As long as there is nothing more than removing all HTML tags from the input, using a regex like yours is safe.

Anything else would probably be more hassle than it's worth, but you could write a small function that loops through the string char-by-char once and removes everything that's within tag brackets — e.g.:

switch on a "inTag" flag as soon as you encounter a "<" character,
switch it off as soon as you encounter ">"
copy characters to the output string as long as the flag is off
for performance, use a StringBuilder Java object instead of string concatenation

For a high-demand part of your app, this may be faster than the regex. But the regex is clean and probably fast enough.

Maybe this modified regex has some advantages for you:

<[^>]*(?:>|$)

catches unclosed tags at the end of the string
[^>]* is better than (.|\n)

The use of REReplaceNoCase() is unnecessary when there are no actual letters in the pattern. Case-insensitive regex matching is slower than doing it case-sensitively.

OTHER TIPS

HTML is not a Regular language, so using Regular expressions on (uncontrolled) HTML is something that should be done with great care (if at all).

Consider, for example, the following valid segment of HTML:

<img src="boat.jpg" alt="a boat" title="My boat is > everything! I <3 my boat!">

You'll note how the syntax highlighter is choking on that - as will the existing regex that has been offered.

Unless you can be certain that the string you are processing will not contain HTML code similar to the above, you should avoid making assumptions/compromise, which a single/pure regex route would force you to do.

(Note: The same problem applies to the suggested char-by-char method too.)

To solve your problem, you should use a DOM parser to parse your string into a HTML object, looping through each element and converting to text.

If you have valid XHTML then you can use CF's XmlParse() to produce the object which you can then loop though. If it might be non-XML HTML then there's no built-in option with CF8, so you'll have to investigate options in Java/etc.

I use this:

REReplaceNoCase(text, "<[^[:space:]][^>]*>", "", "ALL");

99% of the cases it works fine.

The best way is usually to coerce < to < and > to >. This way you aren't making assumptions about the nature of the message. Somebody may be talking about <tags> or trying to be <<expressive>> or describing a keystroke <Ctrl>+C or using maths 1 < x > 3. Even smilies could trigger the regex <8P X>

<cfloop from="1" to="#ArrayLen(myFeed.item)#" index="i">
    <cfset myFeed.item[i].description.value = ReplaceList(myFeed.item[i].description.value, '<,>', '&lt;,&gt;')>
</cfloop>

cflib is your friend: stripHTML

<cfset a = "<b><font color = 'red'>(PCB) <1 ppm </font></b>">

<cfset b = REReplaceNoCase(a, "<[^><]*>", '', 'ALL')>

<cfdump var="#b#">

output b = "(PCB) <1 ppm"

The Regex "<[^><]*>" will remove all tags and the characters inside those tags and will not remove single tags like < or > which can be used as less than or greater than symbol in string

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow