Question

I am working on finding a good way to make user submitted data, in this case allow HTML and have it be as safe and fast as I can.

I know EVERY SINGLE PERSON on this site seems to think http://htmlpurifier.org is the answer here. I do agree partially. htmlpurifier has the best open source code out there for filtering user submitted HTML but there solution is very bulky and is not good for performance on a high traffic site. I might even use there solution someday but for now my goal is to find a more lightweight method.

I have been using the 2 functions below for about 2 and a half years now with no problems yet but I think it is time to take some input from the pro's on here if they will help me.

The first function is called FilterHTML($string) it is ran before user data is saved to a mysql database. The second function is called format_db_value($text, $nl2br = false) and I use it on a page where I plan to show the user submitted data.

Below the 2 functions is a bunch of the XSS codes I found on http://ha.ckers.org/xss.html and I then ran them on these 2 functions to see how affective my code is, I am somewhat pleased with the results, they did block out every code I tried but I know it is still not 100% safe obviously.

Can you guys please look over it and give me any advice for my code itself or even on the whole html filtering concept.

I would like to do a whitelist approach someday but htmlpurifier is the only solution I have found worth using for that and as I mentioned it is not lightweight as I would like.

function FilterHTML($string) {
    if (get_magic_quotes_gpc()) {
        $string = stripslashes($string);
    }
    $string = html_entity_decode($string, ENT_QUOTES, "ISO-8859-1");
    // convert decimal
    $string = preg_replace('/&#(\d+)/me', "chr(\\1)", $string); // decimal notation
    // convert hex
    $string = preg_replace('/&#x([a-f0-9]+)/mei', "chr(0x\\1)", $string); // hex notation
    //$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
    $string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#U', "$1;", $string);
    $string = preg_replace('#(<[^>]+[\s\r\n\"\'])(on|xmlns)[^>]*>#iU', "$1>", $string);
    //$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string); //bad line
    $string = preg_replace('#/*\*()[^>]*\*/#i', "", $string); // REMOVE /**/
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\`\'\"]*)[\\x00-\x20]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //JAVASCRIPT
    $string = preg_replace('#([a-z]*)([\'\"]*)[\x00-\x20]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iU', '...', $string); //VBSCRIPT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*([\\\]*)[\\x00-\x20]*@([\\\]*)[\x00-\x20]*i([\\\]*)[\x00-\x20]*m([\\\]*)[\x00-\x20]*p([\\\]*)[\x00-\x20]*o([\\\]*)[\x00-\x20]*r([\\\]*)[\x00-\x20]*t#iU', '...', $string); //@IMPORT
    $string = preg_replace('#([a-z]*)[\x00-\x20]*e[\x00-\x20]*x[\x00-\x20]*p[\x00-\x20]*r[\x00-\x20]*e[\x00-\x20]*s[\x00-\x20]*s[\x00-\x20]*i[\x00-\x20]*o[\x00-\x20]*n#iU', '...', $string); //EXPRESSION
    $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
    $string = preg_replace('#</?t(able|r|d)(\s[^>]*)?>#i', '', $string); // strip out tables
    $string = preg_replace('/(potspace|pot space|rateuser|marquee)/i', '...', $string); // filter some words
    //$string = str_replace('left:0px; top: 0px;','',$string);
    do {
        $oldstring = $string;
        //bgsound|
        $string = preg_replace('#</*(applet|meta|xml|blink|link|script|iframe|frame|frameset|ilayer|layer|title|base|body|xml|AllowScriptAccess|big)[^>]*>#i', "...", $string);
    } while ($oldstring != $string);
    return addslashes($string);
}

Below function is used when showing user submitted code on a webpage

function format_db_value($text, $nl2br = false) {
    if (is_array($text)) {
        $tmp_array = array();
        foreach ($text as $key => $value) {
            $tmp_array[$key] = format_db_value($value);
        }
        return $tmp_array;
    } else {
        $text = htmlspecialchars(stripslashes($text));
        if ($nl2br) {
            return nl2br($text);
        } else {
            return $text;
        }
    }
}

The codes below are from ha.ckers.org and they all seem to fail on my functions above

I did not try everyone on that site though there is a lot more, this is just some of them.
The original code is on the top line of each set and the code after running through my functions is on the line below it.

<IMG SRC="javascript:alert(\'XSS\');"><b>hello</b> hiii
<IMG SRC=...alert('XSS');"><b>hello</b> hiii

<IMG SRC=JaVaScRiPt:alert('XSS')>
<IMG SRC=...alert('XSS')>

<IMG SRC=javascript:alert(String.fromCharCode(88,83,83))>
<IMG SRC=...alert(String.fromCharCode(88,83,83))>

<IMG SRC=&#106;&#97;&#118;&#97;&#115;&#99;&#114;&#105;&#112;&#116;&#58;&#97;&#108;&#101;&#114;&#116;&#40;&#39;&#88;&#83;&#83;&#39;&#41;>
<IMG SRC=...alert('XSS')>

<IMG SRC=&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041>
<IMG SRC=F  MLEJNALN !>

<IMG SRC=&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29>
<IMG SRC=...alert('XSS')>


<IMG SRC="jav&#x0A;ascript:alert('XSS');">
<IMG SRC=...alert('XSS');">

perl -e 'print "<IMG SRC=javascript:alert("XSS")>";' > out
perl -e 'print "<IMG SRC=java\0script:alert(\"XSS\")>";' > out

<BODY onload!#$%&()*~+-_.,:;?@[/|\]^`=alert("XSS")>
...

<iframe src=http://ha.ckers.org/scriptlet.html <
...

<LAYER SRC="http://ha.ckers.org/scriptlet.html"></LAYER>
......

<META HTTP-EQUIV="Link" Content="<http://ha.ckers.org/xss.css>; REL=stylesheet">
...; REL=stylesheet">

<IMG STYLE="xss:...(alert('XSS'))">
<IMG STYLE="xss:expr/*XSS*/ession(alert('XSS'))">

<XSS STYLE="xss:...(alert('XSS'))">
<XSS STYLE="xss:expression(alert('XSS'))">

<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>

<EMBED SRC="data:image/svg+xml;base64,PHN2ZyB4bWxuczpzdmc9Imh0dH A6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcv MjAwMC9zdmciIHhtbG5zOnhsaW5rPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hs aW5rIiB2ZXJzaW9uPSIxLjAiIHg9IjAiIHk9IjAiIHdpZHRoPSIxOTQiIGhlaWdodD0iMjAw IiBpZD0ieHNzIj48c2NyaXB0IHR5cGU9InRleHQvZWNtYXNjcmlwdCI+YWxlcnQoIlh TUyIpOzwvc2NyaXB0Pjwvc3ZnPg==" type="image/svg+xml" AllowScriptAccess="always"></EMBED>


<IMG
SRC
=
"
j
a
v
a
s
c
r
i
p
t
:
a
l
e
r
t
(
'
X
S
S
'
)
"
>

<IMG
SRC
=...
a
l
e
r
t
(
'
X
S
S
'
)
"
>
Was it helpful?

Solution

Here is four alternatives :

OTHER TIPS

Only way to be sure is to whitelist the tags and attributes that they can use and write strict regexps to validate allowed values of attributes. If you want to allow attributes such as "style" then you have additional complexity.

Blacklisting only might make attack for some people harder but it will not make it any harder for the person that uses technique you have not heard of yet.

I'd try using regexp to add missing closing tags to what users entered and replace <br> with <br /> and so on, then parse it using SimpleXML, then iterate over it and remove any tag that is not in whitelist, any attribute that is not in the whitelist for given tag, and any attribute that has a value that does conform to precise regexp for this attribute. After all I'd use asXML() to get the text back. I'd start with minimal set of tags and attributes and add new ones as needed being especially careful of anything that may contain url.

IMHO htmlawed is the best -- lean, fast, full HTML coverage, most flexible... black OR white list for tags AND attributes. Safe? Defeats all the ha.ckers XSS codes

How about using PHP's native HTML parser?

I was curious about it, so I've wrote some code for testing (requires PHP 5.3.6+):

$badHtml = file_get_contents('badHtml.txt');
$html = sprintf('<div id="input">%s</div>', $badHtml);

// tidy is no required, but may fix invalid markup
$tidy = new \tidy();
$tidy->parseString($html, array(), 'utf8');
$tidy->cleanRepair();

$dom = new \DomDocument('1.0', 'UTF-8');
libxml_use_internal_errors(true);
$dom->loadHtml($tidy);
$input = $dom->getElementById('input');

// tag as key, attributes as values
$allowed = array(
  'table'  => array('border'),  
  'tbody'  => array(),
  'tr'     => array(),
  'td'     => array(),
  'th'     => array(),
  'img'    => array('src', 'alt'),
  'p'      => array(),
  'ul'     => array(),
  'ol'     => array(),
  'li'     => array(),
  'a'      => array('href', 'title'),
  'strong' => array(),
  'em'     => array(),
  'sub'    => array(),
  'sup'    => array(),
);

$walk = function(\DomNode $node) use($allowed, &$walk){

  // only check tags
  if($node->nodeType !== XML_ELEMENT_NODE)
    return;

  if(!isset($allowed[$node->nodeName]))
    return $node->parentNode->removeChild($node);

  foreach($node->attributes as $key => $attr){
    if(!in_array($key, $allowed[$node->nodeName], true))
     $node->removeAttribute($key);

    // expect URLs here
    if(!in_array($key, array('href', 'src'), true))
      continue;

    if(!filter_var($attr->value, FILTER_VALIDATE_URL))
      return $node->parentNode->removeChild($node); 

  }

  array_map($walk, iterator_to_array($node->childNodes));  
};

// convert DOMNodeList to array because this way the bad stuff
// can be removed within the loop
array_map($walk, iterator_to_array($input->childNodes));

// export HTML
$sanitized = $dom->saveHtml($input);

The output, without running Tidy:

enter image description here

Seems ok. Or did it remove too much? :) Should be way faster than HTMLPurifier, theoretically more secure since it's less permissive, and probably faster than the regexes too.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top