Generating a CSV and uploading it to S3 when finished in a background job

https://stackoverflow.com/questions/12165450

28-06-2021
|

Pregunta

I'm providing users with the ability to download an extremely large amount of data via CSV. To do this, I'm using Sidekiq and putting the task off into a background job once they've initiated it. What I've done in the background job is generate a csv containing all of the proper data, storing it in /tmp and then call save! on my model, passing the location of the file to the paperclip attribute which then goes off and is stored in S3.

All of this is working perfectly fine locally. My problem now lies with Heroku and it's ability to store files for a short duration dependent on what node you're on. My background job is unable to find the tmp file that gets saved because of how Heroku deals with these files. I guess I'm searching for a better way to do this. If there's some way that everything can be done in-memory, that would be awesome. The only problem is that paperclip expects an actual file object as an attribute when you're saving the model. Here's what my background job looks like:

class CsvWorker
  include Sidekiq::Worker

  def perform(report_id)
    puts "Starting the jobz!"
    report = Report.find(report_id)
    items = query_ranged_downloads(report.start_date, report.end_date)

    csv = compile_csv(items)

    update_report(report.id, csv)
  end

  def update_report(report_id, csv)
    report = Report.find(report_id)
    report.update_attributes(csv: csv, status: true)
    report.save!
  end

  def compile_csv(items)
    clean_items = items.compact
    path = File.new("#{Rails.root}/tmp/uploads/downloads_by_title_#{Process.pid}.csv", "w")
    csv_string = CSV.open(path, "w") do |csv|
      csv << ["Item Name", "Parent", "Download Count"]
      clean_items.each do |row|
        if !row.item.nil? && !row.item.parent.nil?
        csv << [
          row.item.name,
          row.item.parent.name,
          row.download_count
          ]
        end
      end
    end

    return path
  end
end

I've omitted the query method for readabilities sake.

Solución

I don't think Heroku's temporary file storage is the problem here. The warnings around that mostly center around the facts that a) dynos are ephemeral, so anything you write can and will disappear without notice; and b) dynos are interchangeable, so the presence of inter-request tempfiles are a matter of luck when you have more than one web dyno running. However, in no situation do temporary files just vanish while your worker is running.

One thing I notice is that you're actually creating two temporary files with the same name:

> path = File.new("/tmp/filename", "w")
 => #<File:/tmp/filename> 
> path.fileno
 => 3 
> CSV.open(path, "w") do |csv| csv << %w(foo bar baz); puts csv.fileno end
4
 => nil

You could change the path = line to just set the filename (instead of opening it for writing), and then make update_report open the filename for reading. I haven't dug into what Paperclip does when you give it an empty, already-overwritten, opened-for-writing file handle, but changing that flow may well fix the issue.

Alternately, you could do this in memory instead: generate the CSV as a string and give it to Paperclip as a StringIO. (Paperclip supports certain non-file objects, including StringIOs, using e.g. Paperclip::StringioAdapter.) Try something like:

# returns a CSV as a string
def compile_csv(items)
  CSV.generate do |csv|
     # ...
  end
end

def update_report(report_id, csv)
  report = Report.find(report_id)
  report.update_attributes(csv: StringIO.new(csv), status: true)
  report.save!
end

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow