Question

I am just confuse about the Sharding and Replication that how they works..According to Definition

Replication: A replica set in MongoDB is a group of mongod processes that maintain the same data set.

Sharding: Sharding is a method for storing data across multiple machines.

As per my understanding if there is data of 75 GB then by replication (3 servers), it will store 75GB data on each servers means 75GB on Server-1, 75GB on server-2 and 75GB on server-3..(correct me if i am wrong)..and by sharding it will be stored as 25GB data on server-1, 25Gb data on server-2 and 25GB data on server-3.(Right?)...but then i encountered this line in the tutorial

Shards store the data. To provide high availability and data consistency, in a production sharded cluster, each shard is a replica set

As replica set is of 75GB but shard is of 25GB then how they can be equivalent...this makes me confuse a lot...I think i am missing something great in this. Please help me in this.

Was it helpful?

Solution

Lets try with this analogy. You are running the library.

As any person who has is running a library you have books in the library. You store all the books you have on the shelf. This is good, but your library became so good that your rival wants to burn it. So you decide to make many additional shelves in other places. There is one the most important shelf and whenever you add some new books you quickly add the same books to other shelves. Now if the rival destroys a shelf - this is not a problem, you just open another one and copy it with the books.

This is replication (just substitute library with application, shelf with a server, book with a document in the collection and your rival is just failed HDD on the server). It just makes additional copies of the data and if something goes wrong it automatically selects another primary.

This concept may help if you

  • want to scale reads (but they might lag behind the primary).
  • do some offline reads which do not touch main server
  • serve some part of the data for a specific region from a server from that specific region
  • But the main reason behind replication is data availability. So here you are right: if you have 75Gb of data and replicate it with 2 secondaries - you will get 75*3 Gb of data.

Look at another scenario. There is no rival so you do not want to make copy of your shelves. But right now you have another problem. You became so good that one shelf is not enough. You decide to distribute your books between many shelves. You decide to distribute them between shelves based on the author name (this is not be a good idea and read how to select sharding key here). So everything that starts with name less then K goes to one shelf everything that is K and more goes to another. This is sharding.

This concept may help you:

  • distribute a workload
  • be able to save data which much more then can fit on a single server
  • do map-reduce things
  • store more data in ram for faster queries

Here you are partially correct. If you have 75Gb, then in sum on all the servers there will be still 75 Gb, but it does not necessarily be divided equally.

But here is a problem with only sharding. Right now your rival appeared and he just came to one of your shelves and burned it. All the data on that shelf is lost. So you want to replicate every shard as well. Basically the notion that

each shard is a replica set

is not true. But if you are doing sharding you have to create a replication for every shard. Because the more shards you have, the bigger is the probability that at least one will die.

OTHER TIPS

Answering Saad's followup answer:

Also you can have shards and replicas together on the same server, it is not recommended way of doing it. Each server should have a single role in the system. If for example you decide to have 2 shards and to replicate it 3 times, you will end up with 6 machines.

I know that this might sound too costly, but you have to remember that this is a commodity hardware and if the service you providing is already so good, that you think about high availability and does not fit one machine, then this is a rather cheap price to pay (in comparison to a dedicated one big machine).

I am writing it as an answer but actually its a question to @Salvador Sir's answer.

Like you said that in sharding 75 GB data "may be" stored as 25GB data on server-1, 25GB on server-2 and 25Gb on server-3. (this distribution depends on the Sharding Key)...then to prevent it from loss we also need to replicate the shard. so this means now every server contains it shards and also the replication of other shards present on other server..means Server-1 will have

1) Its own shard.

2) Replication of Shard present on server-2

3) Replication of Shard present on server-3

same goes with Server-2 and server-3. Am i right?..if this is the case then each server again have 75GB of data again. Right or wrong?

Since we want to make 3 shards and also replicate the data so following is the solution to the above problem.

r has shard and also replica set then in that case the failure of that server will lead to loss of replica set and shard.

However you can have the shard 1 and replica set (replica of shard 2 and shard 3) on same server but this is not advisable..

Sharding is like partition of data. Lets say you have around 3GB of data, and you defined 3 shards, So each shard MIGHT take 1GB of data(And it truly depends on the shard key) Why sharding is needed? Searching a specific data out of 3GB is 3 times complex than searching in 1GB of data. So its almost similar to partition. And sharding helps for fast accessing of data.

Now coming to Replica, Lets say you have the same 3GB of data without any replication(That means only a single copy of data exists) so if anything happens to that machine or the drive, your data is gone. So replication comes into picture to solve this problem, Lets say when you set up the DB, you have given your Replication as 3, which means the same 3GB of data is available 3 times(So the total size could be 9GB divided by each of 3GB copies). Replication helps for fail over.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top