Pergunta

I sharded my mongoDB cluster by hashed _id. I checked the index size, there lies an _id_hashed index which is taking much space:

   "indexSizes" : {
           "_id_" : 14060169088,
           "_id_hashed" : 9549780576
    },

mongoDB manual says that an index on the sharded key is created if you shard a collection. I guess that is the reason the _id_hashed index is out there.

My question is : what is the _id_hashed index for if I only query document by the _id field? can I delete it? as it takes too much space.

ps: it seems mongoDB use the _id index when query, not the _id_hashed index. execution plan for a query:

   "clusteredType" : "ParallelSort",
    "shards" : {
            "rs1/192.168.62.168:27017,192.168.62.181:27017" : [
                    {
                            "cursor" : "BtreeCursor _id_",
                            "isMultiKey" : false,
                            "n" : 0,
                            "nscannedObjects" : 0,
                            "nscanned" : 1,
                            "nscannedObjectsAllPlans" : 0,
                            "nscannedAllPlans" : 1,
                            "scanAndOrder" : false,
                            "indexOnly" : false,
                            "nYields" : 0,
                            "nChunkSkips" : 0,
                            "millis" : 0,
                            "indexBounds" : {
                                    "start" : {
                                            "_id" : "spiderman_task_captainStatus_30491467_2387600"
                                    },
                                    "end" : {
                                            "_id" : "spiderman_task_captainStatus_30491467_2387600"
                                    }
                            },
                            "server" : "localhost:27017"
                    }
            ]
    },
    "cursor" : "BtreeCursor _id_",
    "n" : 0,
    "nChunkSkips" : 0,
    "nYields" : 0,
    "nscanned" : 1,
    "nscannedAllPlans" : 1,
    "nscannedObjects" : 0,
    "nscannedObjectsAllPlans" : 0,
    "millisShardTotal" : 0,
    "millisShardAvg" : 0,
    "numQueries" : 1,
    "numShards" : 1,
    "indexBounds" : {
            "start" : {
                    "_id" : "spiderman_task_captainStatus_30491467_2387600"
            },
            "end" : {
                    "_id" : "spiderman_task_captainStatus_30491467_2387600"
            }
    },
    "millis" : 574
Foi útil?

Solução

MongoDB uses a range based sharding approach. If you choose to use hashed based sharding, you must have a hashed index on the shard key and cannot drop it since it will be used to determine shard to use for any subsequent queries ( note that there is an open ticket to allow you to drop the _id index once hashed indexes are allowed to be unique SERVER-8031 ).

As to why the query appears to be using the _id index rather than the _id_hashed index - I ran some tests and I think the optimizer is choosing the _id index because it is unique and results in a more efficient plan. You can see similar behavior if you shard on another key that has a pre-existing unique index.

Outras dicas

If you sharded on a hashed _id then that's the type of index that was created.

When you did sh.shardCollection( 'db.collection', { _id:"hashed" } ) you told it you wanted to use a hash of _id as the shard key which requires a hashed index on _id.

So, no, you cannot drop it.

The documentation goes into detail exactly what a hashed index is which puzzles me how you have read the documentation but don't know what the hashed index is for.

The index is mainly to stop hot spots within shard keys that may not be evenly distributed with their reads/writes.

So imagine the _id field, it is an ever increasing range, all new _ids will be after, this means that you are always writing at the end of your cluster, creating a hot spot.

As for reading it can be quite common that you only read the newest documents, as such this means the upper range of the _id key is the only one that's used making for a hot spot of both reads and writes in the upper range of the cluster while the rest of your cluster just sits there idle.

The hash index takes this bad shard key and hashes it in such a way that means it is not ever increasing but instead will create an evenly distributed set of data for reads and writes, hopefully cuasing the entire set to be utilised for operations.

I would strongly recommend you do not delete it.

Hashed index is reqired by sharded collection, more exactly, hashed index is reqired by sharding balancer to find documents based on hash value directly, normal query operations dose not require an index to be hasded index, even on shared collection.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top