Question

Rails 4, Mongoid instead of ActiveRecord (but this should change anything for the sake of the question).

Let's say I have a MyModel domain class with some validation rules:

class MyModel
  include Mongoid::Document

  field :text, type: String
  field :type, type: String

  belongs_to :parent

  validates :text, presence: true
  validates :type, inclusion: %w(A B C)
  validates_uniqueness_of :text, scope: :parent # important validation rule for the purpose of the question
end

where Parent is another domain class:

class Parent
    include Mongoid::Document

    field :name, type: String

    has_many my_models
end

Also I have the related tables in the database populated with some valid data.

Now, I want to import some data from an CSV file, which can conflict with the existing data in the database. The easy thing to do is to create an instance of MyModel for every row in the CSV and verify if it's valid, then save it to the database (or discard it).

Something like this:

csv_rows.each |data| # simplified 
  my_model = MyModel.new(data) # data is the hash with the values taken from the CSV row

  if my_model.valid?
    my_model.save validate: false
  else
    # do something useful, but not interesting for the question's purpose
    # just know that I need to separate validation from saving
  end
end

Now, this works pretty smoothly for a limited amount of data. But when the CSV contains hundreds of thousands of rows, this gets quite slow, because (worst case) there's a write operation for every row.

What I'd like to do, is to store the list of valid items and save them all at the end of the file parsing process. So, nothing complicated:

valids = []
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?  # THE INTERESTING LINE this "if" checks only against the database, what happens if it conflicts with some other my_models not saved yet?
    valids << my_model
  else
    # ...
  end
end

if valids.size > 0
  # bulk insert of all data
end

That would be perfect, if I could be sure that the data in the CSV does not contain duplicated rows or data that goes against the validation rules of MyModel.


My question is: how can I check each row against the database AND the valids array, without having to repeat the validation rules defined into MyModel (avoiding to have them duplicated)?

Is there a different (more efficient) approach I'm not considering?

Was it helpful?

Solution

What you can do is validate as model, save the attributes in a hash, pushed to the valids array, then do a bulk insert of the values usint mongodb's insert:

valids = []
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?
    valids << my_model.attributes
  end
end

MyModel.collection.insert(valids, continue_on_error: true)

This won't however prevent NEW duplicates... for that you could do something like the following, using a hash and compound key:

valids = {}
csv_rows.each |data|
  my_model = MyModel.new(data)

  if my_model.valid?
    valids["#{my_model.text}_#{my_model.parent}"] = my_model.as_document
  end
end

Then either of the following will work, DB Agnostic:

MyModel.create(valids.values)

Or MongoDB'ish:

MyModel.collection.insert(valids.values, continue_on_error: true)

OR EVEN BETTER

Ensure you have a uniq index on the collection:

class MyModel
  ...
  index({ text: 1, parent: 1 }, { unique: true, dropDups: true })
  ...
end

Then Just do the following:

MyModel.collection.insert(csv_rows, continue_on_error: true)

http://api.mongodb.org/ruby/current/Mongo/Collection.html#insert-instance_method http://mongoid.org/en/mongoid/docs/indexing.html

TIP: I recommend if you anticipate thousands of rows to do this in batches of 500 or so.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top