BZip2 Native Splitting on Amazon/EMR

https://stackoverflow.com/questions/22822357

26-06-2023
|

Question

We have a question in specific regard to compressed input on an Amazon EMR Hadoop job.

According to AWS:

"Hadoop checks the file extension to detect compressed files. The compression types supported by Hadoop are: gzip, bzip2, and LZO. You do not need to take any additional action to extract files using these types of compression; Hadoop handles it for you."

q.v., http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HowtoProcessGzippedFiles.html

Which seems good--however, looking into BZip2, it appears that the "split" boundaries would be file-based:

   .magic:16                       = 'BZ' signature/magic number
   .version:8                      = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
   .hundred_k_blocksize:8          = '1'..'9' block-size 100 kB-900 kB (uncompressed)
**-->.compressed_magic:48            = 0x314159265359 (BCD (pi))**
   .crc:32                         = checksum for this block
   .randomised:1                   = 0=>normal, 1=>randomised (deprecated)
   .origPtr:24                     = starting pointer into BWT for after untransform
   .huffman_used_map:16            = bitmap, of ranges of 16 bytes, present/not present
   .huffman_used_bitmaps:0..256    = bitmap, of symbols used, present/not present (multiples of 16)
   .huffman_groups:3               = 2..6 number of different Huffman tables in use
   .selectors_used:15              = number of times that the Huffman tables are swapped (each 50 bytes)
   *.selector_list:1..6            = zero-terminated bit runs (0..62) of MTF'ed Huffman table (*selectors_used)
   .start_huffman_length:5         = 0..20 starting bit length for Huffman deltas
   *.delta_bit_length:1..40        = 0=>next symbol; 1=>alter length 
   .contents:2..8                  = Huffman encoded data stream until end of block
**-->.eos_magic:48                   = 0x177245385090 (BCD sqrt(pi))**
   .crc:32                         = checksum for whole stream
   .padding:0..7                   = align to whole byte

With the statement: "Like gzip, bzip2 is only a data compressor. It is not an archiver like tar or ZIP; the program itself has no facilities for multiple files, encryption or archive-splitting, but, in the UNIX tradition, relies instead on separate external utilities such as tar and GnuPG for these tasks."

q.v., http://en.wikipedia.org/wiki/Bzip2

The combination of these two statements I interpret to mean that BZip2 is "split-able", but does so on a by-file basis . . . .

This is relevant, because our job will be receiving a single ~800MiB file via S3--which (if my interpretation is true) would mean that EC2/Hadoop would assign ONE Mapper to the job (for ONE file), which would be sub-optimal, to say the least.

(That being the case, we would obviously need to find a way to partition the input into a set of a 400 files before BZip2 is applied as a solution).

Does anyone know for certain if this is how AWS/EMR Hadoop jobs internally function?

Cheers!

Solution

Being splittable on file boundaries doesn't really mean anything since a .bz2 file doesn't have any concept of files.

A .bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks

Compressed blocks are the key here. A .bz2 file can be split on block boundaries. So the number of splits you can create will depend on the size of a compressed block.

Edit (based on your comment):

A split boundary in Hadoop can often times occur half way through a record, whether or not the data is compressed. TextInputFormat will split on HDFS block boundaries. The trick is in the RecordReader.

Let's say we have split boundary in the middle of the 10th record in a file. The mapper that reads the first split will read up to the end of the 10th record, even though that record ends outside of the mappers allotted split. The second mapper then ignores the first partial record in it's split since it has already been read by the first mapper.

This only works if you can reliably find the end of a record if you are given an arbitrary byte offset into the record.

OTHER TIPS

From what I understood so far, bzip2 files are potentially splittable, and you can do it in Hadoop, but it is still not specifically supported in AWS EMR, at least not "right away". So if you just run a job with a large bzip2 file, you are going to get a single mapper in the first step. I have just recently tried it. Apparently that is also what happens with indexed LZO files too, unless you do some black magic. I am not sure there is a corresponding black magic for splitting bzip2 files in EMR too.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow