Question

The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).

The basic anemome loop worked well:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?

Was it helpful?

Solution

If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.

OTHER TIPS

This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.

Anemone is not a maintained library and fails on many edge cases.

Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.

I had some mild success going from Anemone to Nutch but your mileage may vary.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top