Filtering JavaScript out of HTML

https://stackoverflow.com/questions/858773

21-08-2019
|

Question

I have a rich text editor that passes HTML to the server. That HTML is then displayed to other users. I want to make sure there is no JavaScript in that HTML. Is there any way to do this?

Also, I'm using ASP.NET if that helps.

Solution

The simplest thing to do would be to either strip out tags with a regex. Trouble is that you could do plenty of nasty things without script tags (e.g. imbed dodgy images, have links to other sites that have nasty Javascript) . Disabling HTML completely by convert the less than/greater than characters into their HTML entities forms (e.g. <) could also be an option.

If you want a more powerful solution, in the past I have used AntiSamy to sanitize incoming text so that it's safe for viewing.

OTHER TIPS

The only way to ensure that some HTML markup does not contain any JavaScript is to filter it of all unsafe HTML tags and attributes, in order to prevent Cross-Site Scripting (XSS).

However, there is in general no reliable way of explicitly removing all unsafe elements and attributes by their names, since certain browsers may interpret ones of which you weren't even aware at the time of design, and thus open up a security hole for malicious users. This is why you're much better off taking a whitelisting approach rather than a blacklisting one. That is to say, only allow HTML tags that you are sure are safe, and stripping all others by default. Indeed, only one accidentally permitted tag can make your website vulnerable to XSS.

Whitelisting (good approach)

See this article on HTML sanitisation, which offers some specific examples of why you should whitelist rather than blacklist. Quote from that page:

Here is an incomplete list of potentially dangerous HTML tags and attributes:

script, which can contain malicious script

applet, embed, and object, which can automatically download and execute malicious code

meta, which can contain malicious redirects

onload, onunload, and all other on* attributes, which can contain malicious script

style, link, and the style attribute, which can contain malicious script

Here is another helpful page that suggests a set of HTML tags & attributes as well as CSS attributes that are typically safe to allow, as well as recommended practices.

Blacklisting (generally bad approach)

Although many website have in the past (and currently) use the blacklisting approach, there is almost never any true need for it. (The security risks invariably outweight the potential limitations whitelisting enforces with the formatting capabilities that are granted to the user.) You need to be very aware of its flaws.

For example, this page gives a list of what are supposedly "all" the HTML tags you might want to strip out. Just from observing it briefly, you should notice that it contains a very limited number of element names; a browser could easily include a proprietary tag that unwittingly allowed scripts to run on your page, which is essentially the main problem with blacklisting.

Finally, I would strongly recommend that you utilise an HTML DOM library (such as the well-known HTML Agility Pack) for .NET, as opposed to RegEx to perform the cleaning/whitelisting, since it would be significantly more reliable. (It is quite possible to create some pretty crazy obfuscated HTML that can fool regexes! A proper HTML reader/writer makes the coding of the system much easier, anyway.)

Hopefully that should given you a decent overview of what you need to design in order to fully (or at least maximally) prevent XSS, and how it's critical that HTML sanitisation is performed with the unknown factor in mind.

As pointed out by Lee Theobald, that's a very dangerous plan. You cannot by definition ever produce "safe" HTML by filtering/blacklisting, since the user might put stuff into the HTML that you didn't think about (or that don't even exist in your browser version, but does in others).

The only safe way is a whitelisting approach, i.e. strip everything but plain text and certain specific HTML constructs. This incidentially is what stackoverflow.com does :-).

Here is how I do it using a white-listing approach (Javascript and Python code)

https://github.com/dcollien/FilterHTML

I define a specification for a subset of allowed HTML, and that is only what should get through this filter. There's some options to also purify URL attributes, by only allowing certain schemes (like http:, ftp:, etc.) and disallowing those that would cause XSS/Javascript problems (like javascript:, or even data:)

edit: This isn't going to give you 100% safety out of the box for all situations, but used intelligently and in conjunction with a few other tricks (like checking if urls are on the same domain, and the correct content-type, etc.) it could be what you need

If you want the html to be changed so users can see the HTML code itself. Do a string replace of all '<', '>', '&' and ';'. For example '<' becomes '<'.

If you want the html to work, the easiest way is to remove all HTML and Javascript and then replace the HTML only. Unfortunately there is almost not sure way of removing all javascript and allowing only HTML.

For example you may want to allow images. However you may not know that you can do

<img src='evilscript.js'>

and it can run that script. It becomes very unsafe very fast$. This is why most websites like Wikipedia and this website use special markdown language. This makes it much easier to allow formatting but not malicious javascript.

You may want to check how some browser based WYSIWYG editors such as TinyMCE do. They usually remove JS and seem to do a resonable job at it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow