MongoDB Bulk Insert where many documents already exist

Question 1

My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???

The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.

In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.

The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).

Are bulk inserts appropriate for this use case?

Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.

I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.

I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.

MongoDB 2.6

As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.

The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.

For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.

Question 2

collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
                    if (err && err.code != "11000"){
                        throw err;
                     }

                    db.close();
                    callBack();
});

Question 3

For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).

I wouldn't use the continueOnError and send all documents ... it's less efficient.

Question 4

I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:

From the documentation:

upsert Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.

multi Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.

db.collection.update(
   <query>,
   <update>,
   { upsert: <boolean>, multi: <boolean> }
)

Here is the referenced documentation: http://docs.mongodb.org/manual/reference/method/db.collection.update/