Running hadoop jobs over snappy compressed column familes

https://stackoverflow.com/questions/23255347

08-07-2023
|

Question

I'm attempting to dump a pig relation of a compressed column family. Its a single column whose value is a json blob. It's compressed via snappy compression and the value validator is BytesType. After I create the relation and dump I get garbage. Here is the describe:

ColumnFamily: CF
  Key Validation Class: org.apache.cassandra.db.marshal.TimeUUIDType
  Default column value validator: org.apache.cassandra.db.marshal.BytesType
  Cells sorted by: org.apache.cassandra.db.marshal.UTF8Type
  GC grace seconds: 86400
  Compaction min/max thresholds: 2/32
  Read repair chance: 0.1
  DC Local Read repair chance: 0.0
  Populate IO Cache on flush: false
  Replicate on write: true
  Caching: KEYS_ONLY
  Bloom Filter FP chance: default
  Built indexes: []
  Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
  Compression Options:
    sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

Then I:

grunt> rows = LOAD 'cql://Keyspace/CF' using CqlStorage();

I've also tried:

grunt> rows = LOAD 'cql://Keyspace/CF' using CqlStorage()as (key: chararray, col1: chararray, value: chararray);

but when I dump this it still looks like its binary.

Is compression not handled transparently or am I just missing something? I've done some googling but haven't seen anything on the subject. Also I am using Datastax Enterprise. 3.1. Thanks in advance!

Solution

I was able to solve the issue. There was another layer of compression happening in the DAO that was using java.util.zip.Deflater/Inflater, along with the snappy compression defined on the CF. The solution was to extend CassandraStorage and override the getNext() method. The new implementation calls super.getNext() and inflates the Tuples where appropriate.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow