Question

I am observing something rather strange with my test cluster of 6 nodes running Cassandra 2.0.3. I have about 2,5Tb of data (filesystem-wise) on each node.

--  Address      Load       Tokens  Owns   Host ID                               Rack
UN  10.5.45.160  1.43 TB    256     16.4%  24496067-455a-46fc-b846-d0be2a24bd36  RAC1
UN  10.5.45.156  1.4 TB     256     14.6%  4ff697a2-d501-4be7-ad05-82e37b2445c0  RAC1
UN  10.5.45.159  1.56 TB    256     17.5%  65a3e232-2d7a-44cf-8cc4-046a9a26d3f5  RAC1
UN  10.5.45.161  1.67 TB    256     16.4%  196f645e-d4e1-47ff-a7f5-da4d51cbd5c1  RAC1
UN  10.5.45.157  1.63 TB    256     17.3%  750b8c45-480e-42a7-8cbc-1d8671df5e56  RAC1
UN  10.5.45.158  1.53 TB    256     17.8%  985c8a08-3d92-4fad-a1d1-7135b2b9774a  RAC1

I was running some traffic tests on this cluster but I have stopped it 3 days ago. I was clearly overloading the cluster and I wanted to let it calm down and review my test parameters. I saw that for last week or more I had always about 4K pending compactions. Now the strange part. It's been 3 days without any traffic at all, except few manual requests I have done. Yet all my nodes are still doing compactions endlessly. The number of pending compactions almost does not change, sometimes it drops by 2-3, sometimes increased by similar number, but it stays around 4300. I have absolutely insane number of sstables - about 56K across the cluster according to the stats. All the tables that have any real amount of data (in fact, there is only 4 tables that have lots of data) are using leveled compaction strategy with 160-360 Mb configured as sstable size. No throttling for compaction throughput. 5 disks per node, not the slowest ones. The disk load is real, I see they all work hard. Yet, no progress on these compactions for 3 days. In fact, I see that the disk usage almost does not change.

I am almost sure that something is wrong with Cassandra or its settings so it endlessly compacts the same and same data over and over again. Reads are working fine, I saw that in most of the cases the data is loaded from only one sstable.

One thing to mention: I was suffering from CASSANDRA-6008 issue and had to do some manual cleanup of the compactions in progress to be able to start the node(s).

I have just took a look at one of these CFs and its sstables. Noticing something strange: one one node (others seem to have more or less similar situations) I have about 5330 sstable files (...-Data.db). About 3900 of them are around 258 Mb or so. The remaining ~1500 sstables are between few hundreds of Kb and 200Mb, most of them being actually few Mb only.

cqlsh:mykeyspace> describe table mytable;

CREATE TABLE ... (
 ....
) WITH
  bloom_filter_fp_chance=0.100000 AND
  caching='KEYS_ONLY' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'sstable_size_in_mb': '256', 'class': 'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'SnappyCompressor'};

(edited after some investigations). Looks like here is what happens with the compactions. Every compaction picks 32 files from L0. I think this is the condition from LeveledManifest.getCompactionCandidates():

if (generations[0].size() > MAX_COMPACTING_L0)
                {
...

I have thousands of sstables at this level so it falls into this condition, I believe.

Then, it compacts these 32 sstables of about 256Mb each and that creates exactly 32 new sstables of ~256Mb each. And so on, and so on.

Was it helpful?

Solution

To keep anyone looking at SO in the loop, this was logged as a bug here: https://issues.apache.org/jira/browse/CASSANDRA-6496

A patch to fix the problem is attached there, and should end up in Cassandra 2.0.4.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top