Question

I have a big file that is formatted as follows

sample name \t index \t score

And I'm trying to split this file based off of sample name using Hadoop Streaming. I know ahead of time how many samples there are, so can specify how many reducers I need. This post is doing something very similar, so I know that this is possible.

I tried using the following script to split this file into 16 files (there are 16 samples)

hadoop jar $STREAMING \
    -D mapred.text.key.partitioner.options=-k1,1 \
    -D stream.num.map.output.key.fields=2 \
    -D mapred.reduce.tasks=16 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
    -mapper cat \
    -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
    -input input_dir/*part* -output output_dir

This somewhat works - some of the files contain only one sample name. However most of the part* files are blank and some of the part* files contain multiple sample names.

Is there a better way to make sure that every reducer gets only one sample name?

Was it helpful?

Solution

FYI, there is actually a much cleaner way to split up files using a custom OutputFormat

This link describes how to do this really well. I ended up tailoring this other link for my specific application. Altogether, its only a few extra lines of Java

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top