Question

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:

url="file:///C:/2011/index.html"

Anemone.crawl(url) do |anemone|
  titles = []
  anemone.on_every_page { |page| titles.push page.doc.at

('title').inner_html rescue nil }
  anemone.after_crawl { puts titles.compact }
end

So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.

Was it helpful?

Solution

You have couple of problems with this approach

  1. Anemone expect a web address to issue http request and you are passing it a file. You can just load the file with nokogiri instead and do the parsing through it

  2. The links on the files might be full urls rather than the relative paths, in this case you still need to issue http request

What you could do is download the files locally, than traverse through them using nokogiri and convert the links to local path for Nokogiri to load next

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top