Question

I would like to have your opinion regarding Partitioner vs MultipleOutputs.
Suppose I have a file which contains keys as

0:aaa  
1:bbb  
0:ccc  
0:ddd  
...  
1:zzz  

I would like have 2 files: one file containing keys starting with 0: and the other containing keys starting with 1:. Which approach should I use:
1) Use a custom Partitioner which will parse the keys and returns 0 or 1 for getPartition().
2) Use MultipleOutputs.write in the reduce phase, by parsing the key and providing zero or one for the namedOutput parameter of MultipleOutputs.write.

Which one is better? For me, 1) is better because reducers deal with a single file.

Was it helpful?

Solution

If your job is only to split the input files into 2 parts, then MultipleOutputs is a better bet as you can save on the shuffle / sort phase (by running a map only job).

Now if you have lots of input files and don't want 2x the number of output files as you have input, then using the partitioner based approach will allow you to consolidate the input files into 2 outputs (they won't be nicely named however, another benefit of MultipleOutputs, but you can easily fix this by using MultipleOutputs in your reducer and LaxyOutputFormat to ensure that the empty part-r files won't be written as output).

So to answer - it depends on how many input files you have, and how many output files you want.

OTHER TIPS

When you say the first option is better that means you bound by 2 values.. suppose if you get other key value u might need to change your partitioner or cofiguration to set 3 reducers, so better idea is use multipleoutputs

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top