Вопрос

I am trying to use nokogiri to scrape a page for some div text.

The pattern in the HTML looks like this. It has hundreds or divs so that are formatted this way:

<div class="thing text-text" data-thing-id="29966403">
  <div class="thinguser"><i class="ico ico-water ico-blue"></i>
  <div class="status">in 7 days
</div>
</div>
<div class="ignore-ui pull-right"><input type="check box" >
</div>
<div class="col_a col text">
  <div class="text">foobar
  </div>
  </div>
<div class="col_b col text">
  <div class="text">foobar desc
  </div>
</div>
</div>

(sorry about the bad formatting)

I just want to grab the ID (data-thing-id) and the col_a text from each code block so that the output looks like:

29966403 foobar
29964234 barfoo

Here's the code I currently have that does not work:

#!/usr/bin/env ruby

require 'nokogiri'
require 'open-uri'

tids = Array.new
terms = Array.new

doc = Nokogiri::HTML(open("http://somewebsite.com/"))

tids = doc.xpath("//div[contains(@class,'thing')]/data-thing-id()").collect {|node| node.text.strip}
terms = doc.xpath("//div[contains(@class,'col_b')]/text()").collect {|node| node.text.strip}

tids.zip(terms).each do |tid.term|
puts tid+" "+term
end

Thanks in advance, Chris

Это было полезно?

Решение

Try:

tids =  doc.xpath("//div[contains(concat(' ', @class, ' '),' thing ')]").collect {|node| node['data-thing-id']}
terms = doc.xpath("//div[contains(concat(' ', @class, ' '),' col_b ')]").collect {|node| node.text.strip }

tids.zip(terms).each do |tid, term|
  puts tid+" "+term
end
#  => 29966403 foobar desc

What the above code is doing is using an XPATH on the doc to find each of the DIVs that contain the classes thing and col_b respectively. Then it takes each of the found DIVs and extracts either the attribute data-thing-id or the displayed text contained within the element, and creates arrays out of the results.

Nokogiri supports both xpath and css, and you can find how to fully utilize those tools by looking at their respective documentations

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top