I have worked out the answer. The page itself delays a few ms before loading - which means that the html that Nokogiri receives is more or less empty. So now I am searching for a way to get Nokogiri to delay before jumping in. It turns out that the "block" class is simply a Drupal block theme. Nothing mysterious at all!
Can a mysterious html "block" class stop access to specific divs? (Rails / HTTParty)
-
30-07-2022 - |
Question
I am using HTTParty to access and parse certain web pages.
Using my usual approach:
response = HTTParty.get(url)
doc = Nokogiri::HTML(response)
doc.css('ul').each do |link|
p link
end
All is going well, until I get to a page which has a class "block". See the html tree below:
<li class="river-block">
<div class="block block-thumb">
<div class="block-content">
Some content that I want
</div>
</div>
</li>
So for example:
doc.css('ul li').each do |link|
p link
end
or
doc.css('ul li.river-block').each do |link|
p link
end
returns nothing.
What is this class "block" or "block-content"? Is it being used to block this type of access, or am I just going about it the wrong way? And if so, is there any other way in to read the content?
Solution
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow