What is the best way to handle user generated html content that will be viewed by the public?

StackOverflow https://stackoverflow.com/questions/1608758

  •  05-07-2019
  •  | 
  •  

Question

In my web application I allow user generated content to be posted for public consumption similar to Stackoverflow.

What is the best practice for handing this?

My current steps for handling user generated content are:

  1. I use MarkItUp to allow users an easy way to format their html.

  2. After a user has submitted thier changes I run it through an HTML Sanitizer (scroll to the bottem) that uses a white list approach.

  3. If the Sanitization process has removed any user created content I do not save the content. I then Return there modified content with a warning message, "Some illegal content tags where detected and removed double check your work and try again."

  4. If the content passes through the sanitization process cleanly, I save the raw html content to the database.

  5. When rendering to the client I just pass the raw html out of the db to the page.

Was it helpful?

Solution

That's an entirely reasonable approach. For typical applications it will be entirely sufficient.

The trickiest part of white-listing raw HTML is the style attribute and embed/object. There are legitimate reasons why someone might want to put CSS styles into an otherwise untrusted block of formatted text, or say, an embedded YouTube video. This issue comes up most commonly with feeds. You can't trust the arbitrary block of text contained within a feed entry, but you don't want to strip out, e.g., syntax highlighting CSS or flash video, because that would fundamentally change the content and potentially confuse anyone reading it. Because CSS can contain dangerous things like behaviors in IE, you may have to parse the CSS if you decide to allow the style attribute to stay in. And with embed/object you may need to white-list hostnames.

Addenda:

In worst case scenarios, HTML escaping everything in sight can lead to a very poor user experience. It's much better to use something like one of the HTML5 parsers to go through the DOM with your whitelist. This is much more flexible in terms of how you present the sanitized output to your users. You can even do things like:

<div class="sanitized">
  <div class="notice">
    This was sanitized for security reasons.
  </div>
  <div class="raw"><pre>
    &lt;script&gt;alert("XSS!");&lt;/script&gt;
  </pre></div>
</div>

Then hide the .raw stuff with CSS, and use jQuery to bind a click handler to the .sanitized div that toggles between .raw and .notice:

CSS:

.raw {
  display: none;
}

jQuery:

$('.sanitized').click(function() {
  $(this).find('.notice').toggle();
  $(this).find('.sanitized').toggle();
});

OTHER TIPS

The white list is a good move. Any black list solution is prone to letting through more than it should, because you just can't think of everything. I've seen some attemts of using black lists (for example The Code Project), and if they manage to catch everything, generally they still cause additional problems like replacing characters in code so that it can't be used without manually restoring it first.

The safest method would be:

  1. HTML encode all the text.

  2. Match a set of allowed tags and attributes and decode those.

Using a regular expression you can even require that each opening tag has a closing tag, so that an unclosed tag can't mess up the page.

You should be able to do this in something like ten lines of code, so the code that you linked to seems overly complicated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top