Question

I am creating a collection which stores JSON object using MongoDB. I am stuck in Sharding part. I have an Case ID,Customer ID and Location for each of the record in the collection

The Case ID is a 10 digit number (only number and no alphabets).

The CustomerID is a combination of customer name and case ID.

The location is a 2dsphere value and I am expecting a location of different distinct values.

In addition to this I have customer name and case description to the record. All my search queries have search criteria of either Case ID, CustomerID or location.

Given this scenario, Can I create a compound key based on all these three values (CaseID, CustomerID and location). I believe this gives a high cardinality and easy to retrieve the records.

Could any one please suggest me if this is a good approach as I am not finding a compound shard key comprising of three values.

Thanks for your time and let me know if you need any information

Was it helpful?

Solution

The first thing to consider is whether it's necessary to shard. If your data set fits on a single server, then start out with an unsharded deployment. It's easy and seamless to convert this to a sharded cluster later on if needed.

Assuming you do indeed need to shard, your choice of shard key should be based on the following criteria:

  1. Cardinality - choose a shard key that is not limited to a small number of possible values, so that MongoDB can evenly distribute data among the shards in your cluster.
  2. Write distribution - choose a shard key that evenly distributes write operations among shards in the cluster, to prevent any single shard from becoming a bottleneck.
  3. Query isolation - choose a shard key that is included in your most frequent queries, so that those queries may be efficiently routed to a single target shard that holds the data, as opposed to being broadcast to all shards.

You mention that all your queries contain either Case ID, Customer ID or location, but haven't described your use cases. By way of an example let's suppose your most frequent queries are to:

  • retrieve a customer case
  • retrieve all cases for a given customer

In such case, a good shard key candidate would be a compound shard key on (name, caseID) in that order (and a corresponding compound index). Consider whether this satisfies the above criteria:

  1. Cardinality - each document has a different value for the shard key so cardinality is excellent.
  2. Write distribution - cases for all customers are distributed across all shards.
  3. Query isolation:
    • To retrieve a specific case, name and caseID should be included in the query. This query will be routed to the specific shard that holds the document.
    • To retrieve all cases for a given customer, include name in the query. This query therefore includes a prefix of the shard key so will also be efficiently routed only to the specific shard(s) that hold documents that match the query.

Note that you cannot use a geospatial index as part of a shard key index (as documented here). However, you can still create and use a geospatial index on a sharded collection if using some other fields for the shard key. So for example, with the above shard key:

  • a geospatial query that also includes customer name will be targeted at the relevant shard(s).
  • a geospatial query that doesn't include customer name will be broadcast to all shards (a 'scatter/gather' query).

Additional documentation on shard key considerations can be found here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top