Question

I ran to this error using Rails HTML::FullSanitizeron rails console:

h = HTML::FullSanitizer.new
html = "Something with invalid characters \x80 and tags ī."
h.sanitze html

ArgumentError: invalid byte sequence in UTF-8
from /Users/benaluan/.rbenv/versions/1.9.3-p385/lib/ruby/gems/1.9.1/gems/actionpack-3.2.12/lib/action_controller/vendor/html-scanner/html/sanitizer.rb:37:in `sanitize'

What I tried is to encode the html before sanitizing:

html = html.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

It works, however, it removes the ī character. Does anyone experienced the same issue?

Was it helpful?

Solution

Read this article which describes exactly your problem: http://www.spacevatican.org/2012/7/7/stripping-invalid-utf-8/

A code of a solution from this article:

html = html.force_encoding('UTF-8').
      encode('UTF-16', :invalid => :replace, :replace => '').
      encode('UTF-8')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top