Can Anemone crawl html files stored locally on my hard drive?

https://stackoverflow.com/questions/10837856

12-06-2021
|

Question

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before crawling it with something like Anemone + Nokogiri. When I tried the sample code with the government site's online URL, everything worked fine, but when I change the URL to my local file path, the code runs, but doesn't produce any output. Here's the code:

url="file:///C:/2011/index.html"

Anemone.crawl(url) do |anemone|
  titles = []
  anemone.on_every_page { |page| titles.push page.doc.at

('title').inner_html rescue nil }
  anemone.after_crawl { puts titles.compact }
end

So nothing gets outputted with the local file name, but it works successfully if I plug in the corresponding online URL. Is Anemone somehow unable to crawl local directory structures? If not, are there other suggested ways for doing this crawling/scraping, or should I simply run Anemone on the online version of the site? Thanks.

Solution

You have couple of problems with this approach

Anemone expect a web address to issue http request and you are passing it a file. You can just load the file with nokogiri instead and do the parsing through it
The links on the files might be full urls rather than the relative paths, in this case you still need to issue http request

What you could do is download the files locally, than traverse through them using nokogiri and convert the links to local path for Nokogiri to load next

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow