Sanitize gem (and Loofah) removing text before leading colon inside tags

https://stackoverflow.com/questions/13409801

29-11-2021
|

Frage

Ran into some strange behavior with both Loofah and Sanitize while trying to clean up some html fragments when I noticed that times like "6:30pm" were turning into "30pm".

Did some investigation and found the following:

Loofah.scrub_fragment("<span>asdfasdf 6:30 pm</span>", :strip).to_html
#=> "<span>asdfasdf 30 pm</span>"
Loofah.scrub_fragment("6:30 pm", :strip).to_html
#=> "6:30 pm"
Loofah.scrub_fragment("<foo>asdfasdf 6&#58;30 pm</foo>", :strip).to_html
#=> "asdfasdf 6:30 pm"
Loofah.scrub_fragment("bar:30 pm", :strip).to_html
#=> "bar:30 pm"
Loofah.scrub_fragment("<span>bar:30 pm</span>", :strip).to_html
#=> "<span>30 pm</span>"
Loofah.scrub_fragment("<span>bar: asdfasdfadsf pm</span>", :strip).to_html
#=> "<span>bar: asdfasdfadsf pm</span>"

This is the case with all the variants of Loofah (:prune etc) and of Sanitize, so I'm assuming it's a matter of code common to both of them. Is there anything special I need to be doing to escape colons in the code before sanitizing?

Edit 1 I realize I neglected to mention that I'm using jruby ( jruby 1.7.0 (1.9.3p203) ). I'm trying to sort out if perhaps there may be an issue in nokogiri (Which underlies both of these gems?)

Edit 2 With some further digging, it looks like MIGHT be an issue in Nokogiri on Jruby (I'm on version 1.5.5 of nokagiri, for what that's worth). I checked out nokogiri's fragment parser on Jruby and on Ruby 1.9.3:

Jruby 1.7.0: Unexpected results

doc = Nokogiri::HTML.fragment("<span>3:30pm</span>")
=> #(DocumentFragment:0x5fbc {
  name = "#document-fragment",
  children = [
    #(Element:0x5fc0 { name = "span", children = [ #(Text "30pm")] })]
  })

Ruby 1.9.3: Expected results

 doc = Nokogiri::HTML.fragment("<span>3:30pm</span>")
 => #(DocumentFragment:0x3fc4b102055c {
   name = "#document-fragment",
  children = [
    #(Element:0x3fc4b101fff8 {
      name = "span",
      children = [ #(Text "3:30pm")]
      })]
  })

Will try to keep digging but any suggestions are welcome.

Lösung

I believe it is a regression error in Nokogiri. I was able to replicate your problem, and tried it with several versions of Nokogiri.

It works properly in 1.5.0:

jruby-1.6.7.2 :002 > gem 'nokogiri', '=1.5.0'
 => true 
jruby-1.6.7.2 :003 > require 'nokogiri'
 => true 
jruby-1.6.7.2 :004 > doc = Nokogiri::HTML.fragment("<span>3:30pm</span>")
 => #<Nokogiri::HTML::DocumentFragment:0x7d4 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x7d2 name="span" children=[#<Nokogiri::XML::Text:0x7d0 "3:30pm">]>]>

It fails in 1.5.1:

jruby-1.6.7.2 :002 > gem 'nokogiri', '=1.5.1'
 => true 
jruby-1.6.7.2 :003 > require 'nokogiri'
 => true 
jruby-1.6.7.2 :004 > doc = Nokogiri::HTML.fragment("<span>3:30pm</span>")
 => #<Nokogiri::HTML::DocumentFragment:0x7d4 name="#document-fragment" children=[#<Nokogiri::XML::Element:0x7d2 name="span" children=[#<Nokogiri::XML::Text:0x7d0 "30pm">]>]>

Edit: It's important to note that Nokogiri was built around the awesome libxml2 C library which is really unmatched in features, speed, and ability to handle malformed markup. The JRuby implementation is an attempt to match it using Xerces and NekoHTML. I think they have done a wonderful job making the JRuby implementation almost completely match the functionality (if not the speed) of its MRI counterpart, papering over the difference between the vastly different implementations. That being said, there are still edge cases that crop up from time to time.

I went ahead and filed a bug report on Nokogiri.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow