Pergunta

I have log files compressed with gzip. Currently I load .gz file with pig, parse the log with REGEXP and store the result into the HBase table. The table has option COMPRESSION => 'GZ'.
But I'm wondering if it is best way. Obviously .gz files are smaller, but does how pig read separate rows? Is it unzipping it? Wouldn't it be more efficient to unzip it before loading?

Can anyoneone give me any hint?

Regards pawel

Foi útil?

Solução

Pig basically consists of a compiler which produces sequence of mapreduce jobs for you pig program.
How does pig understand gzipped file :-
your mapreduce job , started by pig script, looks at the file extension to see what type of compression codec it was used to compress and calls the relevant decompressor.
Hadoop first checks all the codecs installed and will report an error if it cant find the codec required for your file.
So, yes it unzips it and then RecordReader reads the unzipped file as one record per line by default.
And this is the reason of heap size error in some mapreduce job as mapred child cant get enough space to keep the uncompressed data.
In hbase, Compression is applied on columnfamily level in the table to reduce the space taken , and also to send less data over wire in case of scan. Also, there are only 3 types of compression supported in hbase - gzip, lzo and snappy and all of these have unique ids which should not be changed.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top