Question

Jeff actually posted about this in Sanitize HTML. But his example is in C# and I'm actually more interested in a Java version. Does anyone have a better version for Java? Is his example good enough to just convert directly from C# to Java?

[Update] I have put a bounty on this question because SO wasn't as popular when I asked the question as it is today (*). As for anything related to security, the more people look into it, the better it is!

(*) In fact, I think it was still in closed beta

Was it helpful?

Solution

Don't do this with regular expressions. Remember, you're not protecting just against valid HTML; you're protecting against the DOM that web browsers create. Browsers can be tricked into producing valid DOM from invalid HTML quite easily.

For example, see this list of obfuscated XSS attacks. Are you prepared to tailor a regex to prevent this real world attack on Yahoo and Hotmail on IE6/7/8?

<HTML><BODY>
<?xml:namespace prefix="t" ns="urn:schemas-microsoft-com:time">
<?import namespace="t" implementation="#default#time2">
<t:set attributeName="innerHTML" to="XSS&lt;SCRIPT DEFER&gt;alert(&quot;XSS&quot;)&lt;/SCRIPT&gt;">
</BODY></HTML>

How about this attack that works on IE6?

<TABLE BACKGROUND="javascript:alert('XSS')">

How about attacks that are not listed on this site? The problem with Jeff's approach is that it's not a whitelist, as claimed. As someone on that page adeptly notes:

The problem with it, is that the html must be clean. There are cases where you can pass in hacked html, and it won't match it, in which case it'll return the hacked html string as it won't match anything to replace. This isn't strictly whitelisting.

I would suggest a purpose built tool like AntiSamy. It works by actually parsing the HTML, and then traversing the DOM and removing anything that's not in the configurable whitelist. The major difference is the ability to gracefully handle malformed HTML.

The best part is that it actually unit tests for all the XSS attacks on the above site. Besides, what could be easier than this API call:

public String toSafeHtml(String html) throws ScanException, PolicyException {

    Policy policy = Policy.getInstance(POLICY_FILE);
    AntiSamy antiSamy = new AntiSamy();
    CleanResults cleanResults = antiSamy.scan(html, policy);
    return cleanResults.getCleanHTML().trim();
}

OTHER TIPS

I'm not to convinced that using a regular expression is the best way for finding all suspect code. Regular expressions are quite easy to trick specially when dealing with broken HTML. For example, the regular expression listed in the Sanitize HTML link will fail to remove all 'a' elements that have an attribute between the element name and the attribute 'href':

< a alt="xss injection" href="http://www.malicous.com/bad.php" >

A more robust way of removing malicious code is to rely on a XML Parser that can handle all kind of HTML documents (Tidy, TagSoup, etc) and to select the elements to remove with an XPath expression. Once the HTML document is parsed into a DOM document the elements to revome can be found easily and safely. This is even easy to do with XSLT.

I extracted from NoScript best Anti-XSS addon, here is its Regex: Work flawless:

<[^\w<>]*(?:[^<>"'\s]*:)?[^\w<>]*(?:\W*s\W*c\W*r\W*i\W*p\W*t|\W*f\W*o\W*r\W*m|\W*s\W*t\W*y\W*l\W*e|\W*s\W*v\W*g|\W*m\W*a\W*r\W*q\W*u\W*e\W*e|(?:\W*l\W*i\W*n\W*k|\W*o\W*b\W*j\W*e\W*c\W*t|\W*e\W*m\W*b\W*e\W*d|\W*a\W*p\W*p\W*l\W*e\W*t|\W*p\W*a\W*r\W*a\W*m|\W*i?\W*f\W*r\W*a\W*m\W*e|\W*b\W*a\W*s\W*e|\W*b\W*o\W*d\W*y|\W*m\W*e\W*t\W*a|\W*i\W*m\W*a?\W*g\W*e?|\W*v\W*i\W*d\W*e\W*o|\W*a\W*u\W*d\W*i\W*o|\W*b\W*i\W*n\W*d\W*i\W*n\W*g\W*s|\W*s\W*e\W*t|\W*i\W*s\W*i\W*n\W*d\W*e\W*x|\W*a\W*n\W*i\W*m\W*a\W*t\W*e)[^>\w])|(?:<\w[\s\S]*[\s\0\/]|['"])(?:formaction|style|background|src|lowsrc|ping|on(?:d(?:e(?:vice(?:(?:orienta|mo)tion|proximity|found|light)|livery(?:success|error)|activate)|r(?:ag(?:e(?:n(?:ter|d)|xit)|(?:gestur|leav)e|start|drop|over)?|op)|i(?:s(?:c(?:hargingtimechange|onnect(?:ing|ed))|abled)|aling)|ata(?:setc(?:omplete|hanged)|(?:availabl|chang)e|error)|urationchange|ownloading|blclick)|Moz(?:M(?:agnifyGesture(?:Update|Start)?|ouse(?:PixelScroll|Hittest))|S(?:wipeGesture(?:Update|Start|End)?|crolledAreaChanged)|(?:(?:Press)?TapGestur|BeforeResiz)e|EdgeUI(?:C(?:omplet|ancel)|Start)ed|RotateGesture(?:Update|Start)?|A(?:udioAvailable|fterPaint))|c(?:o(?:m(?:p(?:osition(?:update|start|end)|lete)|mand(?:update)?)|n(?:t(?:rolselect|extmenu)|nect(?:ing|ed))|py)|a(?:(?:llschang|ch)ed|nplay(?:through)?|rdstatechange)|h(?:(?:arging(?:time)?ch)?ange|ecking)|(?:fstate|ell)change|u(?:echange|t)|l(?:ick|ose))|m(?:o(?:z(?:pointerlock(?:change|error)|(?:orientation|time)change|fullscreen(?:change|error)|network(?:down|up)load)|use(?:(?:lea|mo)ve|o(?:ver|ut)|enter|wheel|down|up)|ve(?:start|end)?)|essage|ark)|s(?:t(?:a(?:t(?:uschanged|echange)|lled|rt)|k(?:sessione|comma)nd|op)|e(?:ek(?:complete|ing|ed)|(?:lec(?:tstar)?)?t|n(?:ding|t))|u(?:ccess|spend|bmit)|peech(?:start|end)|ound(?:start|end)|croll|how)|b(?:e(?:for(?:e(?:(?:scriptexecu|activa)te|u(?:nload|pdate)|p(?:aste|rint)|c(?:opy|ut)|editfocus)|deactivate)|gin(?:Event)?)|oun(?:dary|ce)|l(?:ocked|ur)|roadcast|usy)|a(?:n(?:imation(?:iteration|start|end)|tennastatechange)|fter(?:(?:scriptexecu|upda)te|print)|udio(?:process|start|end)|d(?:apteradded|dtrack)|ctivate|lerting|bort)|DOM(?:Node(?:Inserted(?:IntoDocument)?|Removed(?:FromDocument)?)|(?:CharacterData|Subtree)Modified|A(?:ttrModified|ctivate)|Focus(?:Out|In)|MouseScroll)|r(?:e(?:s(?:u(?:m(?:ing|e)|lt)|ize|et)|adystatechange|pea(?:tEven)?t|movetrack|trieving|ceived)|ow(?:s(?:inserted|delete)|e(?:nter|xit))|atechange)|p(?:op(?:up(?:hid(?:den|ing)|show(?:ing|n))|state)|a(?:ge(?:hide|show)|(?:st|us)e|int)|ro(?:pertychange|gress)|lay(?:ing)?)|t(?:ouch(?:(?:lea|mo)ve|en(?:ter|d)|cancel|start)|ime(?:update|out)|ransitionend|ext)|u(?:s(?:erproximity|sdreceived)|p(?:gradeneeded|dateready)|n(?:derflow|load))|f(?:o(?:rm(?:change|input)|cus(?:out|in)?)|i(?:lterchange|nish)|ailed)|l(?:o(?:ad(?:e(?:d(?:meta)?data|nd)|start)?|secapture)|evelchange|y)|g(?:amepad(?:(?:dis)?connected|button(?:down|up)|axismove)|et)|e(?:n(?:d(?:Event|ed)?|abled|ter)|rror(?:update)?|mptied|xit)|i(?:cc(?:cardlockerror|infochange)|n(?:coming|valid|put))|o(?:(?:(?:ff|n)lin|bsolet)e|verflow(?:changed)?|pen)|SVG(?:(?:Unl|L)oad|Resize|Scroll|Abort|Error|Zoom)|h(?:e(?:adphoneschange|l[dp])|ashchange|olding)|v(?:o(?:lum|ic)e|ersion)change|w(?:a(?:it|rn)ing|heel)|key(?:press|down|up)|(?:AppComman|Loa)d|no(?:update|match)|Request|zoom))[\s\0]*=

Test: http://regex101.com/r/rV7zK8

I think it block 99% XSS because it is a part of NoScript, a addon that get updated regularly

^(\s|\w|\d|<br>)*?$ 

This will validate characters, digits, whitespaces and also the <br> tag. If you want more risk you can add more tags like

^(\s|\w|\d|<br>|<ul>|<\ul>)*?$

The biggest problem by using jeffs code is the @ which currently isnt available.

I would probably just take the "raw" regexp from jeffs code if i needed it and paste it into

http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

and see the things needing escape get escaped and then use it.


Taking the usage of this regex in mind I would personally make sure I understood exactly what I was doing, why and what consequences would be if I didnt succeed, before copy/pasting anything, like the other answers try to help you with.

(Thats propbably pretty sound advice for any copy/paste)

[\s\w\.]*. If it doesn't match, you've got XSS. Maybe. Take note that this expression only allows letters, numbers, and periods. It avoids all symbols, even useful ones, out of fear of XSS. Once you allow &, you've got worries. And merely replacing all instances of & with &amp; is not sufficient. Too complicated to trust :P. Obviously this will disallow a lot of legitimate text (You can just replace all nonmatching characters with a ! or something), but I think it will kill XSS.

The idea to just parse it as html and generate new html is probably better.

An old thread but maybe this will be useful for other users. There is a maintained security layer tool for php: https://github.com/PHPIDS/ It is based on a set of regex which you can find here:

https://github.com/PHPIDS/PHPIDS/blob/master/lib/IDS/default_filter.xml

This question perfectly illustrates a great application of the study of Theory of Computation. Theory of Computation is a field that focuses on producing mathematical representations of computers.

Some of the most profound research in the computation theory is the proofs which illustrate the relationships of various languages.

Some of the language relationships that computation theorists have proven include:

enter image description here

This shows that context free languages are more powerful than regular languages. Thus if a language is explicitly context free (context free and not regular), then it is impossible for any regular expression to recognize it.

JavaScript is at the very least context free, thus we know with one-hundred percent certainty that designing a regular expression (regex) capable of catching all XSS is an impossible task.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top