Frage

I'm trying to parse the Twitter usernames from a bit.ly stats page using Nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://bitly.com/U026ue+/global'))

twitter_accounts = []

shares = doc.xpath('//*[@id="tweets"]/li')

shares.map do |tweet|
  twitter_accounts << tweet.at_css('.conv.tweet.a')
end

puts twitter_accounts

My understanding is that Nokogiri will save shares in some form of tree structure, which I can use to drill down into, but my mileage is varying.

War es hilfreich?

Lösung 2

Actually, Eric Walker was onto something. If you look at doc, the section where the tweets are supposed to be look like:

<h2>Tweets</h2>
  <ul id="tweets"></ul>
</div>

This is likely because they're generated by some JavaScript call which Nokogiri isn't executing. One possible solution is to use watir to traverse to the page, load the JavaScript and then save the HTML.

Here is a script that accomplishes just that. Note that you had some issues with your XPath arguments which I've since solved, and that watir will open a new browser every time you run this script:

require 'watir'
require 'nokogiri'

browser = Watir::Browser.new
browser.goto 'http://bitly.com/U026ue+/global'

doc = Nokogiri::HTML.parse(browser.html)

twitter_accounts = []

shares = doc.xpath('//li[contains(@class, "tweet")]/a')

shares.each do |tweet|
  twitter_accounts << tweet.attr('title')
end

puts twitter_accounts
browser.close

You can also use headless to prevent a window from opening.

Andere Tipps

That data is coming in from an Ajax request with a JSON response. It's pretty easy to get at though:

require 'json'
url = 'http://search.twitter.com/search.json?_usragnt=Bitly&include_entities=true&rpp=100&q=nowness.com%2Fday%2F2012%2F12%2F6%2F2643'
hash = JSON.parse open(url).read
puts hash['results'].map{|x| x['from_user']}

I got that URL by loading the page in Chrome and then looking at the network panel, I also removed the timestamp and callback parameters just to clean things up a bit.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top