Firstly, let's take the time to consider what GridFS actually is. And as a starter, lets read from the manual page that is referenced:
GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.
So with that out of the way, and that may well be your use case. But the lesson to learn here is that GridFS is not automatically the "go-to" method for storing files.
What has happened here in your case (and others) is because of the "driver level" specification that this is (and MongoDB itself does no magic here), Your "files" have been "split" across two collections. One collection for the main reference to the content, and the other for the "chunks" of data.
Your problem (and others), is that you have managed to leave behind the "chunks" now that the "main" reference has been removed. So with a large number, how to get rid of the orphans.
Your current reading says "loop and compare", and since MongoDB does not do joins, then there really is no other answer. But there are some things that can help.
So rather than run a huge $nin
, try doing a few different things to break this up. Consider working on the reverse order, for example:
db.fs.chunks.aggregate([
{ "$group": { "_id": "$files_id" } },
{ "$limit": 5000 }
])
So what you are doing there is getting the distinct "files_id" values (being the references to fs.files
), from all of the entries, for 5000 of your entries to start with. Then of course you're back to the looping, checking fs.files
for a matching _id
. If something is not found, then remove the documents matching "files_id" from your "chunks".
But that was only 5000, so keep the last id found in that set, because now you are going to run the same aggregate statement again, but differently:
db.fs.chunks.aggregate([
{ "$match": { "files_id": { "$gte": last_id } } },
{ "$group": { "_id": "$files_id" } },
{ "$limit": 5000 }
])
So this works because the ObjectId
values are monotonic or "ever increasing". So all new entries are always greater than the last. Then you can go an loop those values again and do the same deletes where not found.
Will this "take forever". Well yes. You might employ db.eval()
for this, but read the documentation. But overall, this is the price you pay for using two collections.
Back to the start. The GridFS spec is designed this way because it specifically wants to work around the 16MB limitation. But if that is not your limitation, then question why you are using GridFS in the first place.
MongoDB has no problem storing "binary" data within any element of a given BSON document. So you do not need to use GridFS just to store files. And if you had done so, then all of your updates would be completely "atomic", as they only act on one document in one collection at a time.
Since GridFS deliberately splits documents across collections, then if you use it, then you live with the pain. So use it if you need it, but if you do not, then just store the BinData
as a normal field, and these problems go away.
But at least you have a better approach to take than loading everything into memory.