What is the best way to parse a web page in Ruby?
-
02-07-2019 - |
Question
I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on stackoverflow how can I get the data into a usable format?
Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.
I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.
Solution
Use Nokogiri now.
OTHER TIPS
Unfortunately stackoverflow is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://stackoverflow.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i
And so forth.
try hpricot, its well... awesome
I've used it several times for screen scraping.
I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.
I also read this post a while back and it looks like it would be useful for you.
Haven't done either myself, so YMMV but these seem pretty useful.
Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.
Good luck!
it seems to be an old topic but here is a new one. Example getting reputation:
#!/usr/bin/env ruby
require 'rubygems'
require 'hpricot'
require 'open-uri'
user = "619673/100kg"
html = "http://stackoverflow.com/users/%s?tab=reputation"
page = html % user
puts page
doc = Hpricot(open(page))
pars = Array.new
doc.search("div[@class='subheader user-full-tab-header']/h1/span[@class='count']").text.each do |p|
pars << p
end
puts "reputation " + pars[0]