Is consistent hashing required for sharding?

https://softwareengineering.stackexchange.com/questions/422079

22-03-2021
|

Question

I am reading about scaling of database and came to know about sharding technique. But I also read about consistent hashing technique. So how practically sharding is implemented? Do we arrange nodes in ring like consistent hashing and then assign servers to rings and then data to servers? Because as I see if my number of shards changes at run time, and if consistent hashing technique is not there, then it will mess up a lot of stuff. Can someone please throw some light on this?

Solution

Consider sharding as a form of distributed hash table, or distributed range table.

Now it depends on which one the above the sharding is doing.

For a distributed hash table, for each new piece of data added, it hashes the data and based on that hash directs the data to that machine/set of machines for persistence.

When a query comes in it is sent to all servers, and the results are combined, and depending on how complex the query is some post-processing may occur (because each machine must be over selective as query data may be split between two shards) before sending back to the user.

In this scheme you don't need perfect hashing, because every database is being asked. However you might want to optimise some queries like select * from X where id = '123'. If you don't have perfect hashing you must ask all shards the question, as while the hash might not point at them now, it might have when the data was added.

For a distributed range table, for each new piece of data added, it is sent to a machine/set of machine based on which part of the range it is in. For example 1-15 -> shard A 16-22 -> shard B. Shard might become unbalanced using this method and its not uncommon to have a background process splitting large shards into smaller shards and relocating a portion to a less utilised set of machines. In this sense a set of machines may be responsible for numerous sub sections of the range.

When a query comes in it is decomposed into simpler queries, and those simpler queries are directed to only those machines how could possibly have matching data. But instead of sending the results back to the co-ordinator, those machines might be directed to send results to each other so that then next subquery can be run on the proper shard. At the end of this the final queries data can be directly streamed back to the user.

In this scheme I would not call the partitioning function a Hash. For those properties that are part of partition, then it is possible to identify the exact shard/s that the value/s might be in, if it exists at all. But for those properties not in the partition all shards must be checked.

Of course real life databases are much more complicated than this. This is just a bed time story to help in the understanding of what's going on. A real system has to do this while also managing transactions, synchronisation, networks, etc.., and be fast.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange