hadoop streaming ensuring one key per reducer

https://stackoverflow.com/questions/7431901

31-10-2019
|

Question

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.

The key values can change and are text strings.

Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.

Except, this doesn't seem to work because the keys don't get mapped to specific reducers.

The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.

I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.

Any suggestions would be much appreciated.

Examples:

mapper output:

csv2\tfieldA,fieldB,fieldC csv1\tfield1,field2,field3,field4 csv3\tfieldRed,fieldGreen ...

the problem is that if i have 3 reducers i end up with key distribution like this:

reducer1        reducer2        recuder3
csv1            csv2
csv3

one reducer gets two different key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow