I was able to solve the issue. There was another layer of compression happening in the DAO that was using java.util.zip.Deflater/Inflater, along with the snappy compression defined on the CF. The solution was to extend CassandraStorage and override the getNext() method. The new implementation calls super.getNext() and inflates the Tuples where appropriate.
Running hadoop jobs over snappy compressed column familes
-
08-07-2023 - |
Question
I'm attempting to dump a pig relation of a compressed column family. Its a single column whose value is a json blob. It's compressed via snappy compression and the value validator is BytesType. After I create the relation and dump I get garbage. Here is the describe:
ColumnFamily: CF
Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
GC grace seconds: 86400
Compaction min/max thresholds: 2/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
Then I:
grunt> rows = LOAD 'cql://Keyspace/CF' using CqlStorage();
I've also tried:
grunt> rows = LOAD 'cql://Keyspace/CF' using CqlStorage()as (key: chararray, col1: chararray, value: chararray);
but when I dump this it still looks like its binary.
Is compression not handled transparently or am I just missing something? I've done some googling but haven't seen anything on the subject. Also I am using Datastax Enterprise. 3.1. Thanks in advance!
Solution
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow