Question

I have a document and want to extract a couple of elements which ar direct descendents of the parent element but leave out others. The problem is that I don't get the elements in the order they appear in the document. The reason might actually be that the CSS selector I am using is wrong...

require 'rubygems'
require 'nokogiri'
require 'open-uri'

html = <<END
  <content>
    <p>Lorem</p>
    <div>
      FOO
      <p>BAR</p>
    </div>
    <h1>Ipsum</h1>
    <p>Dolor</p>
    <div>
      BAR
      <h2>FOO</h2>
    </div>
    <h2>Sit</h2>
    <p>Amet</p>
  </html>
END

Nokogiri::HTML(html).css('content > p, content > h1, content > h2').inner_html # "<p>Lorem</p><p>Dolor</p><p>Amet</p><h1>Ipsum</h1><h2>Sit</h2>"

What I want is

<p>Lorem</p><h1>Ipsum</h1><p>Dolor</p><h2>Sit</h2><p>Amet</p>
Was it helpful?

Solution

Try using this XPath:

//content/p|//content/h1|//content/h2

OTHER TIPS

You want the different elements to be listed the way they appear in the document, but as you can see, you get the elements according to the css selector order.

To solve this you would have to add a class attribute to the elements so you select all the elements with that class, than you use only one css selector which would imply that the elements would be in the right order.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top