Cassandra num_tokens - is this really num_token_partitions?

Question 1

4) Partition ranges are determined by granting each node the range from their available tokens up until the next specified token.

2)Data is exchanged through gossip detailing which nodes have which tokens. This meta-data allows every node to know which nodes are responsible for which ranges. Keyspace/Replication settings also change where data is actually saved.

EXAMPLE: 1)A gets 256 ranges B gets 256 Ranges. But to make this simple lets give them each 2 tokens and pretend the token range is 0 to 30

Given tokens: A 10,15 and B 3,11 Nodes are responsible for the following ranges

(3-9:B)(10:A)(11-14:B)(15-30,0-2:A)

3)If C Joins also with 2 tokens 20,5 Nodes will now be responsible for the following ranges

(3-4:B)(5-9:C)(10:A)(11-14:B)(15-19:A)(20-30,0-2:C)

Vnodes are powerful because now when C joins the cluster it gets its data from multiple nodes (5-9 from B and 20-30,0-2 from A) sharing the load between those machines. In this toy example you can see that having only 2 tokens allows for some nodes to host the majority of the data while others get almost none. As the number of Vnodes increases the balance between the nodes increases as the ranges become randomly subdivided more and more. At 256 nodes you are extremely likely to have distributed an even amount of data to each node in the cluster.

For more information VNodes: http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

Question 2

Also RussS answer is correct, I think it's difficult to follow.

The idea is not so much the token allocation, because that's the technical mean used by Cassandra for the concept of distributing access to the data.

What's important are the replication factor and the ring to understand how this is meaningful.

The way the replication works is by copying the data of a node on the next two. So if you're on node A, the data assigned to A is replicated on B and C. The data assigned to B, is replicate on C and D, and so on.

If you have just 3 nodes and a replication of 3, it does not make any difference.

If you have 100 nodes, a replication of 3 and num_tokens: 1, then exactly 3 nodes replicate the data they are assigned and that's always the entire set of data of a node. In our example above, that means all the data A is assigned can be read from A, B, or C and only those three nodes. So if you are trying to load that specific data often and the rest not so often, your cluster is going to be rather unbalanced.

With v-nodes, the data is broken up in sub-partitions. One computer represents many virtual nodes. So old computer A may now represent A, D, G, J, M assuming a num_tokens: 5.

Next we have the ring. When building the ring, the computers will connect between each others in such a way that the same computer doesn't connect to itself (A won't talk to D directly and vice versa.)

Now, it means that one physical computer is going to be connected to num_tokens × replication_factor - 1 other computers. So with num_tokens set to 5 and a replication of 3, you are going to be connected to 10 other computers. This means the load is going to be shared between 10 computers instead of 3 (as the replication factor would otherwise imply.)

So with 16 nodes, a num_tokens: 256 and replication: 3, it would be a strange setup since it would imply that all the nodes are connected 512 times between each others. That being said, having to change the num_tokens later can take a little time for the cluster to adjust to the new value. Especially if you have a large installation. So if you foresee having a large number of nodes, a rather large num_tokens is a good idea from the start.

As a side effect, it will also distribute the data between various tables (files) on each node. That can also help finding data faster. It is actually suggested that you use a larger number of instances (16 to 64) whenever you create an Elassandra cluster to ease the search.

Question 3

At 256 nodes you are extremely likely to have distributed an even amount of data to each node in the cluster.

Unless of course it's not. Random Vnode token range allocation has nothing to do with balanced load. Balanced load is token range ENGINEERED to be balanced, not guessed.

Then there are the bugs in token range allocation CASSANDRA-6388 and CASSANDRA-7032 neither one fixed in any cluster running in production today. Then there are the major problems with 256 VNODE clusters and trying to rebuild them or back them up which is impossible, literally.

Rebuilds and recoveries take WEEKS. And just try running hadoop against vnodes in production. Give up an engineered token range cluster for VNODE hail mary's at your peril.