Specifying compression codec for a INSERT OVERWRITE SELECT in Hive
-
27-10-2019 - |
Question
I have a hive table like
CREATE TABLE beacons
(
foo string,
bar string,
foonotbar string
)
COMMENT "Digest of daily beacons, by day"
PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );
To populate, I am doing something like:
SET hive.exec.compress.output=True;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
someFunc(query, "foo") as foo,
someFunc(query, "bar") as bar,
otherFunc(query, "foo||bar") as foonotbar
)
FROM raw_logs
WHERE day = "2011-01-26";
This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.
Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.
Solution
Before the INSERT OVERWRITE prepend with the following runtime configuration values:
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;
Also make sure you have the desired compression codec by checking:
io.compression.codecs
Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage
I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.