Question

Due to some limitations I want to switch my current project from EventMachine/EM-Synchrony to Celluloid but I've some trouble to get in touch with it. The project I am coding on is a web harvester which should crawl tons of pages as fast as possible.

For the basic understanding of Celluloid I've generated 10.000 dummy pages on a local web server and wanna crawl them by this simple Celluloid snippet:

#!/usr/bin/env jruby --1.9

require 'celluloid'
require 'open-uri'

IDS = 1..9999
BASE_URL = "http://192.168.0.20/files"

class Crawler
  include Celluloid
  def read(id)
    url = "#{BASE_URL}/#{id}"
    puts "URL: " + url
    open(url) { |x| x.read }
  end
end

pool = Crawler.pool(size: 100)

IDS.to_a.map do |id|
   pool.future(:read, id)
end

As far as I understand Celluloid, futures are the way to go to get the response of a fired request (comparable to callbacks in EventMachine), right? The other thing is, every actor runs in its own thread, so I need some kind of batching the requests cause 10.000 threads would result in errors on my OSX dev machine.

So creating a pool is the way to go, right? BUT: the code above iterates over the 9999 URLs but only 1300 HTTP requests are sent to the web server. So something goes wrong with limiting the requests and iterating over all URLs.

Was it helpful?

Solution

Likely your program is exiting as soon as all of your futures are created. With Celluloid a future will start execution but you can't be assured of it finishing until you call #value on the future object. This holds true for futures in pools as well. Probably what you need to do is change it to something like this:

  crawlers = IDS.to_a.map do |id|
    begin
      pool.future(:read, id)
    rescue DeadActorError, MailboxError
    end
  end

  crawlers.compact.each { |crawler| crawler.value rescue nil }
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top