Domanda

I am currently working on an application where I scrape information from a number of different sites. To get the deeplink for the desired topic on a site I rely on the sitemap that is provided (e.g. "Forum"). As I am expanding I came across some sites that don't provide a sitemap themselves, so I was wondering if there was any way to generate it within Rails from the top level domain?

I am using Nokogiri and Mechanize to retrieve data, so if there is any functionality that could help to tackle that task it would be easier to integrate.

È stato utile?

Soluzione

This can be done with the Spidr gem like so:

url_map = Hash.new { |hash,key| hash[key] = [] }

Spidr.site('http://intranet.com/') do |spider|
  spider.every_link do |origin,dest|
    url_map[dest] << origin
  end
end
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top