Question

Will regularly running nodetool repair on my Cassandra nodes cripple them?

The Planet Cassandra FAQ notes (emphasis added) that

Anti-Entropy Node Repair – For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations.

That is the only reference I've seen to be running nodetool repair regularly. Running it regularly won't be a problem if it is cheap, but just how expensive is it? Does it do the equivalent of consistency-checked read of every record on the node? Or is it more clever than that? The documentation mentions the use of Merkle trees, but that does not give me any idea how expensive the operation is.

If you have 500 GB of data on a node, and that node is actually consistent with other nodes (the repair is a no-op), about how much data does the repair read from the disk (reading all 500 GB would take a couple of hours)? And about how much data is sent over the LAN (sending all 500 GB over the LAN could take another hour or so).

Was it helpful?

Solution

Some uses cases are more dependent upon regular repairs than others. If you perform deletes at less than ConsistencyLevel.ALL then you should run repair to ensure deleted columns don't come back to life. If you don't do deletes, you can rely on hinted handoff and read repair to maintain consistency in many cases. If you read and write at low consistency levels, or regularly have server downtime or overloading, you will probably want to run repair.

What repair does is read through all the data on the node you run it on (optionally, with the -pr (primary range) option, only the ranges for which the node owns the primary range) and build up a Merkle tree. It also sends a message to all nodes that store replicas of any of these ranges to do the same - they will only read through the data that is replicated on the initial repair node.

To build a Merkle tree on a node with 500 GB will read through the full 500 GB (when using -pr, it will be roughly a factor of replication factor lower). However, the Merkle trees are constant size (a few MB) so very little is sent over the network if the nodes are in sync.

The best way to run scheduled repairs is to run with -pr on each node in turn. This avoids repairing the same data multiple times. Also, only run on one node at once to avoid placing extra load on your cluster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top