To keep anyone looking at SO in the loop, this was logged as a bug here: https://issues.apache.org/jira/browse/CASSANDRA-6496
A patch to fix the problem is attached there, and should end up in Cassandra 2.0.4.
Frage
I am observing something rather strange with my test cluster of 6 nodes running Cassandra 2.0.3. I have about 2,5Tb of data (filesystem-wise) on each node.
-- Address Load Tokens Owns Host ID Rack
UN 10.5.45.160 1.43 TB 256 16.4% 24496067-455a-46fc-b846-d0be2a24bd36 RAC1
UN 10.5.45.156 1.4 TB 256 14.6% 4ff697a2-d501-4be7-ad05-82e37b2445c0 RAC1
UN 10.5.45.159 1.56 TB 256 17.5% 65a3e232-2d7a-44cf-8cc4-046a9a26d3f5 RAC1
UN 10.5.45.161 1.67 TB 256 16.4% 196f645e-d4e1-47ff-a7f5-da4d51cbd5c1 RAC1
UN 10.5.45.157 1.63 TB 256 17.3% 750b8c45-480e-42a7-8cbc-1d8671df5e56 RAC1
UN 10.5.45.158 1.53 TB 256 17.8% 985c8a08-3d92-4fad-a1d1-7135b2b9774a RAC1
I was running some traffic tests on this cluster but I have stopped it 3 days ago. I was clearly overloading the cluster and I wanted to let it calm down and review my test parameters. I saw that for last week or more I had always about 4K pending compactions. Now the strange part. It's been 3 days without any traffic at all, except few manual requests I have done. Yet all my nodes are still doing compactions endlessly. The number of pending compactions almost does not change, sometimes it drops by 2-3, sometimes increased by similar number, but it stays around 4300. I have absolutely insane number of sstables - about 56K across the cluster according to the stats. All the tables that have any real amount of data (in fact, there is only 4 tables that have lots of data) are using leveled compaction strategy with 160-360 Mb configured as sstable size. No throttling for compaction throughput. 5 disks per node, not the slowest ones. The disk load is real, I see they all work hard. Yet, no progress on these compactions for 3 days. In fact, I see that the disk usage almost does not change.
I am almost sure that something is wrong with Cassandra or its settings so it endlessly compacts the same and same data over and over again. Reads are working fine, I saw that in most of the cases the data is loaded from only one sstable.
One thing to mention: I was suffering from CASSANDRA-6008 issue and had to do some manual cleanup of the compactions in progress to be able to start the node(s).
I have just took a look at one of these CFs and its sstables. Noticing something strange: one one node (others seem to have more or less similar situations) I have about 5330 sstable files (...-Data.db). About 3900 of them are around 258 Mb or so. The remaining ~1500 sstables are between few hundreds of Kb and 200Mb, most of them being actually few Mb only.
cqlsh:mykeyspace> describe table mytable;
CREATE TABLE ... (
....
) WITH
bloom_filter_fp_chance=0.100000 AND
caching='KEYS_ONLY' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'sstable_size_in_mb': '256', 'class': 'LeveledCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
(edited after some investigations). Looks like here is what happens with the compactions. Every compaction picks 32 files from L0. I think this is the condition from LeveledManifest.getCompactionCandidates():
if (generations[0].size() > MAX_COMPACTING_L0)
{
...
I have thousands of sstables at this level so it falls into this condition, I believe.
Then, it compacts these 32 sstables of about 256Mb each and that creates exactly 32 new sstables of ~256Mb each. And so on, and so on.
Lösung
To keep anyone looking at SO in the loop, this was logged as a bug here: https://issues.apache.org/jira/browse/CASSANDRA-6496
A patch to fix the problem is attached there, and should end up in Cassandra 2.0.4.