BZip2 file read in Hadoop

Question 1

If you look at: http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/30662, you will find that bzip2 format is indeed splittable and multiple mappers can work on one file. The patch was submitted at: https://issues.apache.org/jira/browse/HADOOP-4012. However, it seems it is available only above HADOOP 0.21.0.

From personal experience in order to use this technique of bzip2 there is nothing different that you need to do. hadoop should pick it up automatically depending on your min split size.

bzip2 compressed data by blocks and therefore it is possible to decompress it in blocks and send each block to a separate mapper. However, gzip does not have such a technique and therefore this cannot be sent to different mappers.

Question 2

You can look a pbzip2 for an example of parallel bz2 compression and decompression.

There is a parallel gzip as well, pigz. It does parallel compression, but not parallel decompression. The deflate format is not suited to parallel decompression. However you can either a) prepare a special gzip stream with resets of the history, or b) you can build an index into a gzip file on the first pass. Either way, you can then read different parts in parallel, or have more efficient random access.