Question

I'm trying to parse a webpage using open-uri + hpricot but it seems to be a problem in the parsing proccess as the gems don't bring me the things I want.

Specifically I want to get this div (whose id is 'pasajes') in this url:

http://www.despegar.com.ar

I write this code:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

But it bring NOTHING! I've tried lot of things in both hpricot and nokogiri:

  1. I try giving the absolute path to that div
  2. I try CSS path with selectors
  3. I try with hpricot search shortcut (doc//"div#pasajes")
  4. Almost every posible relative path to reach the 'pasajes' div

Finally i found a horrible solution. I have used the watir library and after open a web browser, i have passed the html to hpricot. In this way hpricot DO RECOGNIZE the 'pasajes' div. But i don't want just to open a web-browsere only for parsing purposes...

What I'm doing wrong? Is open-uri working bad? Is hpricot?

Was it helpful?

Solution

There's no DIV with the id pasajes in the static HTML page. If you are running *nix you can see that by doing:

curl http://www.despegar.com.ar/ | grep pasajes

My guess is that it's JavaScript-generated.

If you are using MacRuby you could try Lyndon.

OTHER TIPS

There's no div with id 'pasajes' in that page. That's the problem.

This fits more as an additional comment on Jonas' answer above rather than an answer in itself... But I am new to SO and do not have the "commenting powers" yet :)

You can use Selenium RC to download the full HTML and then use nokogiri on the downloaded file. Note that this will work only if the content is being generated/modified by Javascript. If the webpage depends on cookies to setup the content your options would be Selenium (in the browser) or watir as you have noted.

I would love to hear a better solution to this (want to parse webpage with nokogiri, but the page is modified by JS).

I ran into a similar issue with Nokogiri but on OS X 10.5. However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. I found by using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see lots of wonderful HTML. I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. I even had to modify the very demo they use on rubyforge to teach you about Nokogiri.

Using their own example I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=> 

YUCK!

If I tweak to read in the url to a string, I get good stuff:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note I do see this lovely warning when I use irb to play:

HI. You're using libxml2 version 2.6.16 which is over 4 years old and has plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, you upgrade your version of libxml2 and re-install nokogiri. If you like using libxml2 version 2.6.16, but don't like this warning, please define the constant I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri.

But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. So I say, "no way".

Why do I write this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on old stuff and they may have issues with that.

QUESTION

Do any other OS X 10.5 users have this issue with Nokogiri?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top