When using LZO on Hadoop output on AWS EMR, does it index the files (stored on S3) for future automatic splitting?

https://stackoverflow.com/questions/13019996

13-07-2021
|

Question

I want to use LZO compression on my Elastic Map Reduce job's output that is being stored on S3, but it is not clear if the files are automatically indexed so that future jobs run on this data will split the files into multiple tasks.

For example, if my output is a bunch of lines of TSV data, in a 1GB LZO file, will a future map job only create 1 task, or something like (1GB/blockSize) tasks (i.e. the behavior of when files were not compressed, or if there was a LZO index file in the directory)?

Edit: If this is not done automatically, what is recommended for getting my output to be LZO-indexed? Do the indexing before uploading the file to S3?

Solution

Short answer to my first question: AWS does not do automatic indexing. I've confirmed this with my own job, and also read the same from Andrew@AWS on their forum.

Here's how you can do the indexing:

To index some LZO files, you'll need to use my own Jar built from the Twitter hadoop-lzo project. You'll need to build the Jar somewhere, then upload to Amazon S3, if you want to Index directly with EMR.

On side note, Cloudera has good instructions on all the steps for setting this up on your own cluster. I did this on my local cluster, which allowed me to build the Jar and upload to S3. You can probably find a pre-built Jar on the net if you don't want to build it yourself.

When outputting your data from your Hadoop job, make sure you use the LzopCodec and not the LzoCodec, otherwise the files are not indexable (at least based on my experience). Example Java code (same idea carries over to Streaming API):

import com.hadoop.compression.lzo.LzopCodec;
TextOutputFormat.setCompressOutput(job, true); 
TextOutputFormat.setOutputCompressorClass(job, LzopCodec.class)

Once your hadoop-lzo Jar is on S3, and your Hadoop job has outputted .lzo files, run your indexer on the output directory (instructions below you got a EMR job/cluster running):

elastic-mapreduce -j <existingJobId> \
  --jar s3n://<yourBucketName>/hadoop-lzo-0.4.17-SNAPSHOT.jar \
    --args com.hadoop.compression.lzo.DistributedLzoIndexer \
    --args s3://<yourBucketName>/output/myLzoJobResults \
    --step-name "Lzo file indexer Jar"

Then when you're using the data in a future job, be sure to specify that the input is in LZO format, otherwise the splitting won't occur. Example Java code:

import com.hadoop.mapreduce.LzoTextInputFormat;
job.setInputFormatClass(LzoTextInputFormat.class);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow