Question

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))

twitter_accounts = []

shares = doc.xpath('//*[@id="tweets"]/li')

shares.map do |tweet|
  twitter_accounts << tweet.at_css('.conv.tweet.a')
end

puts twitter_accounts

My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.

Was it helpful?

Solution 2

Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:

<h2>Tweets</h2>
  <ul id="tweets"></ul>
</div>

This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.

Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:

require 'watir'
require 'nokogiri'

browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'

doc = Nokogiri::HTML.parse(browser.html)

twitter_accounts = []

shares = doc.xpath('//li[contains(@class, "tweet")]/a')

shares.each do |tweet|
  twitter_accounts << tweet.attr('title')
end

puts twitter_accounts
browser.close

You can also use headless to prevent a window from opening.

OTHER TIPS

That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:

require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}

I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top