Is it better to have one collection with a billion or one thousand with one million objects?

StackOverflow https://stackoverflow.com/questions/22698927

  •  22-06-2023
  •  | 
  •  

Question

How much will performance differ between one NoSQL database (MongoDB) containing single collection - logs - with 1 billion entries or one thousand collections (logs_source0, logs_source1)? Will this change if the data is sharded across multiple servers? Objects contain between 6 and 10 keys and sometimes one array of 3-5 objects. The design of the application can use either one of these, as _sourceX can be easily turned into an extra key or vice versa.

Was it helpful?

Solution

As long as all that data is on a single server, having a single big collection or many small ones should not make too much of a difference. As any performance question, a thorough answer would have to take your intended usage of that data into account. Are you frequently accessing all of that data? Or do you have a comparatively small working set of data that is frequently accessed, while the rest is very rarely looked at?

Having many small collections could be better when it comes to selectively paging some of that data into memory. A single big collection can, of course, also be paged into memory selectively, but at least the indexes would have to be entirely within memory if at all possible, to ensure quick access to the data. With many smaller collections, that would be easier since each collection would have its own, small indexes.

However, MongoDB's sharding is meant to solve exactly that problem (maintaining huge amounts of data), and it does so by keeping everything in a single logical collection, but distributing that collection automatically over as many shards as you like. This is far more flexible than creating those individual collections yourself. Among other things, it allows data to be rebalanced over time to make sure that each shard has an equal portion of that data. It is also more flexible to adapt to different numbers of shards, while your multi-collection scheme seems to rely on a rather fixed partitioning of the data (according to source #).

With sharding, the application would be completely unaware of the distribution patterns, and you could add or remove as many shards as you want, transparently, to handle the volume of your data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top