Question

I'm trying to use open-uri to get the html page for a website. However, the problem is that the website needs a couple of seconds to load for it to properly have the correct code. What I have right now is:

require 'open-uri'

html = open('http://hiddencode.me/dribbbucket/embed.html?key=MY_API_KEY&bucket=56024-Glassboard&delay=5000')
response = html.read
puts response

If I run this right now, I get:

<div id="slam-dunk">
    <div id="loading">Loading..</div>
</div>

However, the site needs to properly load first before opening to get the correct response. Any ideas how to do this in ruby? I can also use a solution in another language, if ruby is not your expertise!

Was it helpful?

Solution

As an example, I recently used watir-webdriver to accomplish a similar task. You'll be able to query the DOM after javascript execution and pull anything you want out. If you'd like it to be headless, in my case I used the headless gem.

If you'd like to stick with 'open-uri' then you'll have to use something like httpfox to watch which ajax requests the javascript makes. You can do this with many different tools as well. But you'd start httpfox, in this case, before you visit the url. Wait until you see the information you're trying to scrape appear, then stop httpfox and go through each request checking each response for things relevant to what you're scraping. Once you identify the proper request, you may be able to use that with open-uri. While being the simplest, this solution is not guaranteed as web applications vary widely in how they interact with servers and manipulate the dom.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top