Storage overhead of GridFS

https://stackoverflow.com/questions/18877225

29-06-2022
|

Question

We have a MongoDB Cluster using GridFS. The fs.chunks table of gridfs is sharded over two replicasets. The usage of diskspace is very high. For 90GB of data we need more than 130GB of diskspace.

It seems like the fs.chunks table is needing the space. I did summarize the "length" field of fs.files showing the 90GB of space. The sum of the "size" field of both shards is 130GB. This is the real size of the payload data contained in the collection, right?

This means it has 40GB overhead? Is this correct? Where is it coming from? is it the BSON encoding? Is there a way to reduce overhead?

mongos> db.fs.chunks.stats()
{
    "sharded" : true,
    "ns" : "ub_datastore_preview.fs.chunks",
    "count" : 1012180,
    "numExtents" : 106,
    "size" : 140515231376,
    "storageSize" : 144448592944,
    "totalIndexSize" : 99869840,
    "indexSizes" : {
            "_id_" : 43103872,
            "files_id_1_n_1" : 56765968
    },
    "avgObjSize" : 138824.35078345748,
    "nindexes" : 2,
    "nchunks" : 2400,
    "shards" : {
            "ub_datastore_qa_group1" : {
                    "ns" : "ub_datastore_preview.fs.chunks",
                    "count" : 554087,
                    "size" : 69448405120,
                    "avgObjSize" : 125338.44887174758,
                    "storageSize" : 71364832800,
                    "numExtents" : 52,
                    "nindexes" : 2,
                    "lastExtentSize" : 2146426864,
                    "paddingFactor" : 1,
                    "systemFlags" : 1,
                    "userFlags" : 0,
                    "totalIndexSize" : 55269760,
                    "indexSizes" : {
                            "_id_" : 23808512,
                            "files_id_1_n_1" : 31461248
                    },
                    "ok" : 1
            },
            "ub_datastore_qa_group2" : {
                    "ns" : "ub_datastore_preview.fs.chunks",
                    "count" : 458093,
                    "size" : 71066826256,
                    "avgObjSize" : 155136.2414531547,
                    "storageSize" : 73083760144,
                    "numExtents" : 54,
                    "nindexes" : 2,
                    "lastExtentSize" : 2146426864,
                    "paddingFactor" : 1,
                    "systemFlags" : 1,
                    "userFlags" : 0,
                    "totalIndexSize" : 44600080,
                    "indexSizes" : {
                            "_id_" : 19295360,
                            "files_id_1_n_1" : 25304720
                    },
                    "ok" : 1
            }
    },
    "ok" : 1
}

Solution 2

The problem have been "orphaned chunks" from GridFS. GridFS first writes the chunks then the metadata, if something goes wrong the already written chunks stay as "orphaned chunks" and have to be cleaned manually.

OTHER TIPS

This is the real size of the payload data contained in the collection, right?

Yes.

This means it has 40GB overhead? Is this correct?

Kinda. But it seems unusually large.

Where is it coming from? is it the BSON encoding?

No, BSON encoding of data doesn't have that much of an overhead. But adding metadata sometimes does.

In mongo the main source for overhead is usually the metadata, but if you use the reference grids spec — it shouldn't be that big.

For example, in our storage we have:

db.fs.files.aggregate([{$group: {_id: null, total: { $sum: "$length"}}}])
{
    "result" : [
        {
            "_id" : null,
            "total" : NumberLong("4631125908060")
        }
    ],
    "ok" : 1
}

And

db.fs.chunks.stats()
{
    "ns" : "grid_fs.fs.chunks",
    "count" : 26538434,
    "size" : NumberLong("4980751887148"),
    "avgObjSize" : 187680.70064526037,
    "storageSize" : NumberLong("4981961457440"),
    "numExtents" : 2342,
    "nindexes" : 2,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 0,
    "totalIndexSize" : 2405207504,
    "indexSizes" : {
        "_id_" : 1024109408,
        "files_id_1_n_1" : 1381098096
    },
    "ok" : 1
}

So, about 300 gb overhead on 4.8 tb of data.

You saved 90 GB of data but it consumed 130 GB of disk space.

That means approx 44% overhead.

As stated in this blog post, the storage overhead of GridFS is approximately 45%, which is almost the same in your case.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow