Question

I was trying to download an xml file that has '&' symbols in it using the HTTParty gem and I am getting this error:

"treeparser.rb:95:in `rescue in parse' <RuntimeError: Illegal character '&' 
 in raw string  "4860 BOOMM 10x20 MD&"> (MultiXml::ParseError)"

Here is my code:

class SAPOrders
  include HTTParty
  default_params :output => 'xml'
  format :xml
  base_uri '<webservice url>'
end

xml =  SAPOrders.get('/<nameOfFile.xml>').inspect

What am I missing?

Was it helpful?

Solution

If you are using HTTPParty and it's trying to parse the incoming XML before you can get your hands on it, then you'll need to split that process into the get, and the parse, so you can put code between the two.

I use OpenURI and Nokogiri for just those reasons, but whether you use those two, or their equivalents, you will have the opportunity to pre-process the XML before parsing it. '&' is an illegal character when bare; It should be encoded or in a CDATA block, but unfortunately in the wilds of the internet, there are lots of malformed XML feeds and files.

The thing I like about Nokogiri for this task is it keeps on chugging, at least as far as it can. You can look to see if you had errors after the document is parsed, and you can tweak some of its parser settings to control what it will do or complain about:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

puts doc.errors
puts doc.to_xml

Which will output:

xmlParseEntityRef: no name
<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD">foobar</b>
</a>

Notice that Nokogiri stripped the & but I was still able to get usable output. You have to decide whether you want an error and to halt using the STRICT option, or to continue, but Nokogiri can do either, depending on your needs.

You can massage the incoming XML:

require 'nokogiri'

xml = <<EOT
<a>
  <b parm="4860 BOOMM 10x20 MD&">foobar</b>
</a>
EOT

xml['MD&'] = 'MD&amp;'

doc = Nokogiri::XML(xml) do |config|
  config.strict
end

puts doc.errors
puts doc.to_xml

Which now outputs:

<?xml version="1.0"?>
<a>
  <b parm="4860 BOOMM 10x20 MD&amp;">foobar</b>
</a>

I know this isn't a perfect answer, but from my experience dealing with a lot of RSS/Atom and XML/HTML parsing, sometimes we have to open the dirty-tricks bag and go with whatever works instead of what was elegant.

Another path to nirvana in HTTParty, would be to sub-class the parser. You should be able to get inside that flow of the XML to the parser and massage it there. From the docs:

# Intercept the parsing for all formats
class SimpleParser < HTTParty::Parser
  def parse
    perform_parsing
  end
end
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top