Question

I am using HTTParty to access and parse certain web pages.

Using my usual approach:

response = HTTParty.get(url)
doc = Nokogiri::HTML(response)
doc.css('ul').each do |link|
  p link
end

All is going well, until I get to a page which has a class "block". See the html tree below:

<li class="river-block">        
 <div class="block block-thumb"> 

  <div class="block-content"> 
    Some content that I want                            
  </div>            

 </div>         
</li>

So for example:

 doc.css('ul li').each do |link|
  p link
end

or

doc.css('ul li.river-block').each do |link|
  p link
end

returns nothing.

What is this class "block" or "block-content"? Is it being used to block this type of access, or am I just going about it the wrong way? And if so, is there any other way in to read the content?

Was it helpful?

Solution

I have worked out the answer. The page itself delays a few ms before loading - which means that the html that Nokogiri receives is more or less empty. So now I am searching for a way to get Nokogiri to delay before jumping in. It turns out that the "block" class is simply a Drupal block theme. Nothing mysterious at all!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top