Question

I am processing very large XML files, so I need to use a SAX/evented XML parser. Nokogiri::XML::SAX seemed like an obvious choice, however, the SAX parser seems to choke on small errors, even errors the regular XML parser has no trouble recovering from.

In the example below, the url attribute of <property> has an & that should really be escaped to &amp;. Nokogiri::XML is still able to parse the elements within <property> but Nokogiri::XML::SAX just seems to give up and never triggers events for the elements within <property>.

require 'nokogiri'

class Doc < Nokogiri::XML::SAX::Document
  include Enumerable

  def initialize(xml)
    @xml = xml
  end

  def each(&block)
    @on_record = block
    parse(@xml)
  end

  def parse(xml)
    parser = Nokogiri::XML::SAX::Parser.new(self)
    parser.parse(xml)
  end

  def end_element(name)
    @on_record.call(name) if name == "details"
  end

  def error(str)
    puts str
  end
end

xml = <<XML
<?xml version="1.0" encoding="UTF-8"?>
<streeteasy version="1.5">
  <properties>
    <property url="http://example.com/?foo=bar&yin=yang">
      <location>Somewhere</location>
      <details>Information goes here</details>
    </property>
  </properties>
</streeteasy>
XML

puts Doc.new(xml).count # => 0, but should be 1
puts Nokogiri::XML(xml).xpath("//details").count # => 1

The script above should output:

1
1

However, I get:

EntityRef: expecting ';'
0
1

Is there a way to make Nokogiri ignore these small errors? Is there a better option for SAX/push/pull/evented XML parsing in Ruby that would ignore errors like these?

Was it helpful?

Solution 2

The SAX Parser behaves a little differently and you can actually just set it to recover from any errors. You can also use the error handler method to deal with specific errors.

class MyDoc < Nokogiri::XML::SAX::Document
  def error(error)
    puts "An error occurred: #{error}"
  end

  def start_element(name, attributes = [])
    puts "found a #{name}"
  end
end

parser = Nokogiri::HTML::SAX::Parser.new(MyDoc.new)
parser.parse(open(url)) do |ctx|
  ctx.recovery = true
end

OTHER TIPS

Use Nokogiri's HTML SAX parser instead.

Change this line

parser = Nokogiri::XML::SAX::Parser.new(self)

to this line

parser = Nokogiri::HTML::SAX::Parser.new(self)

The HTML parser apparently runs libxml in recovery mode and is able to recover from errors. This allows the example to output the desired 1/1, albeit with some whining about the non-standard "html" tags.

Tag streeteasy invalid
Tag properties invalid
htmlParseEntityRef: expecting ';'
Tag property invalid
Tag location invalid
Tag details invalid
1
1

Update

It turns out this works for my contrived example, but as soon as Nokogiri::HTML::SAX::Parser#parse is passed an IO instead of a String it chokes on errors just like the XML version. I can't load the file into memory... that defeats the whole purpose of using a SAX parser. So, not accepting my own answer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top