Question

Currently, Typhoeus doesn't have automatic re-download in case of failure. What would be the best way of ensuring a retry if the download is not successful?

def request
  request ||= Typhoeus::Request.new("www.example.com")
  request.on_complete do |response|
    if response.success?
      xml = Nokogiri::XML(response.body)
    else
      # retry to download it
    end
  end
end
Was it helpful?

Solution

I think you need to refactor your code. You should have two queues and threads you're working with, at a minimum.

The first is a queue of URLs that you pull from to read via Typhoeus::Request.

If the queue is empty you sleep that thread for a minute, then look for a URL to retrieve. If you successfully read the page, parse it and push the resulting XML doc into a second queue of DOMs to work on. Process that from a second thread. And, if the second queue is empty, sleep that second thread until there is something to work on.

If reading a URL fails, automatically re-push it onto the first queue.

If both queues are empty you could exit the code, or let both threads sleep until something says to start processing URLs again and you repopulate the first queue.

You also need a retries-counter associated with the URL, otherwise if a site goes down you could retry forever. You could push little sub-arrays onto the queue as:

["url", 0]

where 0 is the retry, or get more complex using an object or define a class. Whatever you do, increment that counter until it hits a drop-dead value, then stop adding that to the queue and report it or remove it from your source of URLs database somehow.

That's somewhat similar to code I've written a couple times to handle big spidering tasks.

See Ruby's Thread and Queue classes for examples of this.

Also:

request ||= Typhoeus::Request.new("www.example.com")

makes no sense. request will be nil when that code runs, so the ||= will always fire. Instead use:

request = Typhoeus::Request.new("www.example.com")

modified with the appropriate code to pull the next value from the first queue mentioned above.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top