Best splittable compression for Hadoop input = bz2?

Question 1

BZIP2 is splittable in hadoop - it provides very good compression ratio but from CPU time and performances is not providing optimal results, as compression is very CPU consuming.

LZO is splittable in hadoop - leveraging hadoop-lzo you have splittable compressed LZO files. You need to have external .lzo.index files to be able to process in parallel. The library provides all means of generating these indexes in local or distributed manner.

LZ4 is splittable in hadoop - leveraging hadoop-4mc you have splittable compressed 4mc files. You don't need any external indexing, and you can generate archives with provided command line tool or by Java/C code, inside/outside hadoop. 4mc makes available on hadoop LZ4 at any level of speed/compression-ratio: from fast mode reaching 500 MB/s compression speed up to high/ultra modes providing increased compression ratio, almost comparable with GZIP one.

Question 2

I don't consider the other answer correct, bzip2 according to this:

http://comphadoop.weebly.com/

is splittable. LZO is too if indexed.

So the answer is yes, if you want to use more mappers than you have files, then you'll want to use bzip2.

To do this, you could write a simple MR job to read the data then just write it out again, you then need to ensure you set mapred.output.compression.codec to org.apache.hadoop.io.compress.BZip2Codec

Question 3

Here are five ways with gzip, three needing an index, two not.

It is possible to create an index for any gzip file, i.e. not specially constructed, as done by zran.c. Then you can start decompression at block boundaries. The index includes the 32K of uncompressed data history at each entry point.

If you are constructing the gzip file, then it can be made with periodic entry points whose index does not need uncompressed history at those entry points, making for a smaller index. This is done with the Z_FULL_FLUSH option to deflate() in zlib.

You could also do a Z_SYNC_FLUSH followed by a Z_FULL_FLUSH at each such point, which would insert two markers. Then you can search for the nine-byte pattern 00 00 ff ff 00 00 00 ff ff to find those. That's no different than searching for the six-byte marker in bzip2 files, except that a false positive is much less likely with nine bytes. Then you don't need a separate index file.

Both gzip and xz support simple concatenation. This allows you to easily prepare an archive for parallel decompression in another way. In short:

gzip < a > a.gz
gzip < b > b.gz
cat a.gz b.gz > c.gz
gunzip < c.gz > c
cat a b | cmp - c

will result in the compare succeeding.

You can then simply compress in chunks of the desired size and concatenate the results. Save an index to the offsets of the start of each gzip stream. Decompress from those offsets. You can pick the size of the chunks to your liking, depending on your application. If you make them too small however, compression will be impacted.

With simple concatenation of gzip files, you could also forgo the index if you made each chunk a fixed uncompressed size. Then each chunk ends with the same four bytes, the uncompressed length in little-endian order, e.g. 00 00 10 00 for 1 MiB chunks, followed by 1f 8b 08 from the next chunk, which is the start of a gzip header. That seven-byte marker can then be searched for just like the bzip2 marker, though again with a smaller probability of false positives.

The same could be done with concatenated xz files, whose header is the seven bytes: fd 37 7a 58 5a 00 00.

Question 4

My 2cents, bzip is very slow for writing. Tested with Apache Spark 1.6.2, Hadoop 2.7, compresse a simple JSON file of 50Go, it takes 2x time with bzip than gzip.

But with bzip, 50Go ==> 4 Go!