Question

I am trying to search for a specific node in an XML file using XPath. This search worked just fine under REXML but REXML was too slow for large XML docs. So moved over to LibXML.

My simple example is processing a Yum repomd.xml file, an example can be found here: http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml

My test script is as follows:

require 'rubygems'
require 'libxml'

p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse

filelist = repomd.find_first("/repomd/data[@type='filelists']/location@href")
puts "Length: " + filelist.length.to_s
filelist.each do |f|
   puts f.attributes['href']
end

I get this error:

Error: Invalid expression.
/usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find': Error: Invalid expression. (LibXML::XML::Error)
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:123:in `find'
from /usr/lib/ruby/gems/1.8/gems/libxml-ruby-2.7.0/lib/libxml/document.rb:130:in `find_first'
from /tmp/scripty.rb:6

I have also tried simpler examples like below, but still no dice.

p = LibXML::XML::Parser.file( "/tmp/dr.xml")
repomd = p.parse
filelist = repomd.root.find(".//location")
puts "Length: " + filelist.length.to_s

In the above case I get the output:

Length: 0

Your inspired guidance would be greatly appreciated, and I have searched for what I am doing wrong, and I just can't figure it out...

Here is some code that will fetch the file and process it, still doesn't work...

require 'rubygems'
require 'open-uri'
require 'libxml'

raw_xml = open('http://mirror.san.fastserv.com/pub/linux/centos/6/os/x86_64/repodata/repomd.xml').read
p = LibXML::XML::Parser.string(raw_xml)
repomd = p.parse
filelist = repomd.find_first("//data[@type='filelists']/location[@href]")
puts "First: " + filelist
Was it helpful?

Solution

In the end I reverted back to REXML and used stream processing. Much faster and much easier XPath syntax implementation.

OTHER TIPS

Looking at your code,it seems you want to collect only those location elements which has href attribute. If that's the case below should work:

"//data[@type='filelists']/location[@href]"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top