I am relatively new to MongoDB, but we consider using it as some sort of cache in front of a legacy service. In this case, we have stumbled across some issues.
First, some explanation.
This caching service will be between a legacy service and clients. Clients will connect to the caching service, which gets its data from the legacy service. The caching service fetches data every X minutes, and keeps them in MongoDB. The schema is really as simple as it can get: just a document with lots of keys/values. No nested documents or such. In addition, we set the _id to a unique ID from the legacy service, so we have control over this as well.
When the caching service fetches data from the legacy service, it gets just a delta (only changes since last fetch). So, if 5 "objects" have changed since last time, you get just those 5 "objects" (but you get the complete object, not a delta of the object). If any new "objects" have been added to the legacy service, those are of course also in the delta.
Our "problem"
In my mind, this sounds like an upsert. If there are new objects, insert them. If there are changes to existing objects, update them. However, MongoDB does not seem to be particularly fond of multiple upserts. Just inserting gives me an error about duplicate keys, which is perfectly understandable since a document already exists with the same _id. The update function, which can take an upsert parameter, can not take a list of new objects. It seems to me that a single query is not possible. There is, though, the possibility that I might have completely overlooked something here.
Possible solutions
There are a number of different solutions, and especially two comes to my mind:
- Do two queries: first, compute a list with all the _id's (remember, we have these from our legacy service). Then, delete them using the $in function toghether with the _id list and immediately insert the new documents. This should in practice update our collection with the new data. It is also easy to implement. A problem that might occur is that a client asks for data between the delete and insert, and therefore wrongly gets an empty result. This is a deal breaker, and can absolutely not happen.
- Do one upsert per changed object. Also quite easy to implement, and should not give the same problem as the other solution. This has other (maybe imaginary) problems though. How many upserts can it handle in a short amount of time? Could it quite easily handle 5000 upserts every minute? These are not big documents, just about 20 key/values and no subdocuments. This number is pulled out from thin air, it is quite hard to predict actual numbers. In my mind, this approach feels wrong. I cannot understand why it would be necessary to run one query per new document.
Any help would be much appreciated, both regarding to the two proposed solutions and any other solutions. As a side note, technology is not really discussable, so please do not suggest other kinds of databases or languages. There are other, strong, reasons why we have chosen what we have chosen :)