What's the best way to schedule and execute repetitive tasks (like scraping a page for information) in Rails?

StackOverflow https://stackoverflow.com/questions/20097287

문제

I'm looking for a solution which enables:

  1. Repetitive executing of a scraping task (nokogiri)
  2. Changing the time interval via http://www.myapp.com/interval (example)

What is the best solution/way to get this done?

Options I know about

  • Custom Rake task
  • Rufus Scheduler

Current situation

In ./config/initializers/task_scheduler.rb I have:

require 'nokogiri'
require 'open-uri'
require 'rufus-scheduler'
require 'rake'

scheduler = Rufus::Scheduler.new

scheduler.every "1h" do
    puts "BEGIN SCHEDULER at #{Time.now}"

    @url = "http://www.marktplaats.nl/z/computers-en-software/apple-ipad/ipad-mini.html?  query=ipad+mini&categoryId=2722&priceFrom=100%2C00&priceTo=&startDateFrom=always"
    @doc = Nokogiri::HTML(open(@url))
    @title = @doc.at_css("title").text

    @number = 0

    2.times do |number|
        @doc.css(".defaultSnippet.group-#{@number}").each do |listing|
            @listing_title = listing.at_css(".mp-listing-title").text
            @listing_subtitle = listing.at_css(".mp-listing-description").text
            @listing_price = listing.at_css(".price").text
            @listing_priority = listing.at_css(".mp-listing-priority-product").text

            listing = Listing.create(title: "#{@listing_title}", subtitle: "#{@listing_subtitle}", price: "#{@listing_price}")

        end

        @number +=1
    end

    puts "END SCHEDULER at #{Time.now}"
end

Is it not working?

Yes the current setup is working. However, I don't know how to enable changing the interval time via http://www.myapp.com/interval (example).

Changing scheduler.every "1h" to scheduler.every "#{@interval} do does not work.

In what file do I have to define @interval for it to work in task_scheduler.rb?

도움이 되었습니까?

해결책 2

First off: your rufus scheduler code is in an initializer, which is fine, but it is executed before the rails process is started, and only when the rails process is started. So, in the initializer you have no access to any variable @interval you could set, for instance in a controller.

What are possible options, instead of a class variable:

  • read it from a config file
  • read it from a database (but you will have to setup your own connection, in the initializer activerecord is not started imho

And ... if you change the value you will have to restart your rails process for it to have effect again.

So an alternative approach, where your rails process handles the interval of the scheduled job, is to use a recurring background job. At the end of the background, it reschedules itself, with the at that moment active interval. The interval is fetched from the database, I would propose. Any background job handler could do this. Check ruby toolbox, I vote for resque or delayed_job.

다른 팁

I'm not very familiar with Rufus Scheduler but it appears that it will be difficult to acheive both of your goals (regular heartbeat, dynamically rescheduled) with it. In order for it to work, you'll have to capture the job_id that it returns, use that job_id to stop the job if a rescheduling event occurs, and then create the new job. Rufus also points out that it's an in-memory application whose jobs will disappear when the process disappears -- reboot the server, restart the application, etc and you've got to reschedule from scratch.

I'd consider two things. First, I'd consider creating a model that wraps the screen-scraping that you want to do. At a minimum you'd capture the url and the interval. The model may wrap up the code for processing the html response (basically what's wrapped up in the 2.times block) as instance methods that you trigger based on the URL. You may also capture this in a text column and use eval on it, assuming that only "good guys" get access to this part of the system. This has a couple of advantages: you can quickly expand to scraping other sites and you can sanitize the interval sent back by the user.

Second, something like Delayed::Job may better suit your needs. Delayed::Job allows you to specify a time for the job's execution which you could fill in by reading the model and converting the interval to a time. The key to this approach is that the job must schedule the next iteration of itself before it exits.

This won't be as rock-steady as something like cron but it does seem to better address the rescheduling need.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top