Question

This is my configuration for 4 Data Centers of Cassandra:

create KEYSPACE mySpace WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1' : 1, 'DC2' : 1, 'DC3' : 1, 'DC4' : 1};

In this configuration (Murmur3Partitioner + 256 tokens) , each DC is storing roughly 25% of the key space. And this 25% are replicated 3 times on each other DC. Meaning that every single row has 4 copies over all.

For instance if my data base is to big to keep 4 complete copies of it, how can I configure cassandra so that each DC is replicated only once or twice (instead of total number of DCs (x3)).

For example: 25% of the key space that is stored on DC1 I want to replicate once on DC2 only. I am not looking for selecting any particular DC for replication neither I care if 25% of DC1 will be split over multiple DC1,2,3 I just want to use NetworkTopologyStrategy but reduce storage costs.

Is it possible ?

Thank you Best Regards

Was it helpful?

Solution

Your keyspace command shows that each of the DCs hold 1 copy of the data. This means that if you have 1 node in each DC, then each node will have 100% of your data. So, I am not sure how you concluded that each of your DCs store only 25% of keys as it is obvious they are storing 100%. Chances are when you run nodetool command you are not specifying the keyspace so the command shows you load which is based on the token range assigned to each node which would be misleading for NetworkTopology setup. Try running it with your keyspace name and see if you notice the difference.

I don't think there is a way to shift data around DCs using any of existing Snitches the way you want it. If you really wanted to have even distribution and you had equal number of nodes in each DC with initial tokens spaced evenly, you could have used SimpleSnitch to achieve what you want. You can change the Snitch to SimpleSnitch and run nodetool cleanup/repair on each node. Bare in mind that during this process you will have some outage because after the SnitchChange, previously written keys may not be available on some nodes until the repair job is done.

The way NetworkTopology works is that if you say you have DC1:1 and you have for example 2 nodes in DC1, it will evenly distribute keys across 2 nodes leading to 50% effective load on each node. With that in mind, I think what you really want to have done is to keep 3 copies of your data, 1 in each DC. So, you can really discard one DC and save money. I am saying this because I think these DCs you have are virtual in the notion of your NetworkTopology and not real physical DC because no one would want to have only 25% of data in one DC as it will not be an available setup. So, I recommend if your nodes are grouped into virtual DCs, you group them into 4 racks instead and maintain 1 DC:

DC1:
nd1-ra_1 rack-a
nd1-rb_1 rack-b
nd1-rc_1 rack-c

nd2-ra_2 rack-a
nd2-rb_2 rack-b
nd2-rc_2 rack-c

nd3-ra_3 rack-a
nd3-rb_3 rack-b
nd3-rc_3 rack-c

nd3-ra_4 rack-a
nd3-rb_4 rack-b
nd3-rc_4 rack-c

In this case, if you set your replication option to DC1:3, each of the racks a,b,and c will have 100% of your data (each node in each rack 25%).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top