Specifying compression codec for a INSERT OVERWRITE SELECT in Hive

https://stackoverflow.com/questions/4831190

27-10-2019
|

Question

I have a hive table like

  CREATE TABLE beacons
 (
     foo string,
     bar string,
     foonotbar string
 )
 COMMENT "Digest of daily beacons, by day"
 PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

To populate, I am doing something like:

 SET hive.exec.compress.output=True;
 SET io.seqfile.compression.type=BLOCK;

 INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
   someFunc(query, "foo") as foo,
   someFunc(query, "bar") as bar,
   otherFunc(query, "foo||bar") as foonotbar
   )
  FROM raw_logs
WHERE day = "2011-01-26";

This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.

Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.

Solution

Before the INSERT OVERWRITE prepend with the following runtime configuration values:

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

Also make sure you have the desired compression codec by checking:

io.compression.codecs

Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage

I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow