Question

I am relatively new to MongoDB, but we consider using it as some sort of cache in front of a legacy service. In this case, we have stumbled across some issues.

First, some explanation.

This caching service will be between a legacy service and clients. Clients will connect to the caching service, which gets its data from the legacy service. The caching service fetches data every X minutes, and keeps them in MongoDB. The schema is really as simple as it can get: just a document with lots of keys/values. No nested documents or such. In addition, we set the _id to a unique ID from the legacy service, so we have control over this as well.

When the caching service fetches data from the legacy service, it gets just a delta (only changes since last fetch). So, if 5 "objects" have changed since last time, you get just those 5 "objects" (but you get the complete object, not a delta of the object). If any new "objects" have been added to the legacy service, those are of course also in the delta.

Our "problem"

In my mind, this sounds like an upsert. If there are new objects, insert them. If there are changes to existing objects, update them. However, MongoDB does not seem to be particularly fond of multiple upserts. Just inserting gives me an error about duplicate keys, which is perfectly understandable since a document already exists with the same _id. The update function, which can take an upsert parameter, can not take a list of new objects. It seems to me that a single query is not possible. There is, though, the possibility that I might have completely overlooked something here.

Possible solutions

There are a number of different solutions, and especially two comes to my mind:

  • Do two queries: first, compute a list with all the _id's (remember, we have these from our legacy service). Then, delete them using the $in function toghether with the _id list and immediately insert the new documents. This should in practice update our collection with the new data. It is also easy to implement. A problem that might occur is that a client asks for data between the delete and insert, and therefore wrongly gets an empty result. This is a deal breaker, and can absolutely not happen.
  • Do one upsert per changed object. Also quite easy to implement, and should not give the same problem as the other solution. This has other (maybe imaginary) problems though. How many upserts can it handle in a short amount of time? Could it quite easily handle 5000 upserts every minute? These are not big documents, just about 20 key/values and no subdocuments. This number is pulled out from thin air, it is quite hard to predict actual numbers. In my mind, this approach feels wrong. I cannot understand why it would be necessary to run one query per new document.

Any help would be much appreciated, both regarding to the two proposed solutions and any other solutions. As a side note, technology is not really discussable, so please do not suggest other kinds of databases or languages. There are other, strong, reasons why we have chosen what we have chosen :)

Was it helpful?

Solution

I'll share my experience ...

At my last job we had a similar situation. We ended up doing one query/write per document/object. We used Mule ESB to pump data from the legacy system to Mongo and each write was an upsert.

The performance was pretty good, but not great. We could get several thousands documents into Mongo in a few minutes. The documents were fairly rich so that might have been part of why we had to throttle the writes to Mongo.

After we bulk loaded data, the "real time" performance was never an issue.

The first option you suggested sounds too complex and potentially leaves Mongo in an unknown state in case the operation dies halfway through an update. The upsert option saved us many times because we could replay inserts over and over and be safe.

OTHER TIPS

To expand on ryan1234's answer:

The 2.6 version of MongoDB will have the ability to send batched updates. For now you will need to submit separate requests for each document.

As ryan1234 said doing an upsert per document is the only safe way to update all of the existing documents and add the new documents if you do not know from the legacy provider. A single MongoDB process can easily handle thousands of updates per second(1) on mid-teir hardware. If you are not getting that level of performance then it is probably the latency of requests between the client and the MongDB server. The Asynchronous Java Driver can help overcome that limitation by allowing multiple update requests to be in-flight to the server at the same time with minimal client side complexity/threading.

HTH, Rob

1: I assume the documents are not growing and no index updates but even with those you should be able to approach a thousand updates a second.

Or in case your keys are compound you could use:

public static BulkWriteResult insertAll(MongoCollection<Document> coll, List<Document> docs, String[] keyTags, boolean upsert) {
    if(docs.isEmpty())
        return null;
    List<UpdateOneModel<Document>> requests = new ArrayList<>(docs.size());
    UpdateOptions opt = new UpdateOptions().upsert(upsert);
    for (Document doc : docs ) {
        BasicDBObject filter = new BasicDBObject();
        for (String keyTag : keyTags) {
            filter.append(keyTag, doc.get(keyTag));
        }
        BasicDBObject action = new BasicDBObject("$set", doc);
        requests.add(new UpdateOneModel<Document>(filter, action, opt));
    }
    return coll.bulkWrite(requests);
}

I know. It had to really dig deep for the right way to do it. Try this: /** * Insert all items in docs to collection. * @param coll the target collection * @param docs the new or updated documents * @param keyTag the name of the key in the document * @param upsert if true creates a new document if not found * @return BulkWriteResult or null if docs.isEmpty() */

    public static BulkWriteResult insertAll(MongoCollection<Document> coll, List<Document> docs, String keyTag, boolean upsert) {
    if(docs.isEmpty())
        return null;
    List<UpdateOneModel<Document>> requests = new ArrayList<>(docs.size());
    UpdateOptions opt = new UpdateOptions().upsert(upsert);
    for (Document doc : docs ) {
        BasicDBObject filter = new BasicDBObject(keyTag, doc.get(keyTag)); 
        BasicDBObject action = new BasicDBObject("$set", doc);
        requests.add(new UpdateOneModel<Document>(filter, action, opt));
    }
    return coll.bulkWrite(requests);
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top