Domanda

The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc).

The basic anemome loop worked well:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

But (because of the site size?) the terminal froze after a while. Therefore, I installed MongoDB and used the following

require 'rubygems'
require 'anemone'
require 'mongo'
require 'json'


$stdout = File.new('sitemap.json','w')


Anemone.crawl("http://www.mybigexamplesite.com/") do |anemone|
  anemone.storage = Anemone::Storage.MongoDB
  anemone.on_every_page do |page|
      puts page.url
  end
end

It's running now, but I'll be very surprised if there's output in the json file when I get back in the morning - I've never used MongoDB before and the part of the anemone docs about using storage weren't clear (to me at least). Can anyone who's done this before give me some tips?

È stato utile?

Soluzione

If anyone out there needs <= 100,000 URLs, the Ruby Gem Spidr is a great way to go.

Altri suggerimenti

This is probably not the answer you wanted to see but I highly advice that you don't use Anemone and perhaps Ruby for that matter for crawling a million pages.

Anemone is not a maintained library and fails on many edge cases.

Ruby is not the fastest language and uses a global interpreter lock which means that you can't have true threading capabilities. I think your crawling will probably be too slow. For more information about threading, I suggest you can check out the following links.

http://ablogaboutcode.com/2012/02/06/the-ruby-global-interpreter-lock/

Does ruby have real multithreading?

You can try using anemone with Rubinius or JRuby which are much faster with but I'm not sure the extent of compatibility.

I had some mild success going from Anemone to Nutch but your mileage may vary.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top