Question

In a Hadoop cluster with 10 terabytes of data and 30 nodes, do we need a partitioner? If yes, why? Please support with an example scenario as to, When do we need to implement a custom partitioner?

Was it helpful?

Solution

Partitioner controls the partitioning of the keys of the intermediate map-outputs. Partitioner decides what map-output-keys are sent to what reducers.

Default is HashPartitioner

Eg: Map output Key/Val: [A,1], [A,3], [B,5], [B,1], [C,9]

With default Partitioner - assuming there are 3 Reducers:
Reducer 1 will get: [A,1], [A,3]
Reducer 2 will get: [B,5], [B,1]
Reducer C will get: [C,9]

So all the data in a single partition gets executed by a single reducer. If number of reducers are less (say two above) - [C,9] will also go to Reducer 1 or 2 (Partition modulo behavior takes care of this).

Custom Partitioner is used to change this default behavior.
For eg: You can decide that key/values are sent to reducers as follows:
Reducer 1 shud get: [A,1], [B,5], [C,9]
Reducer 2 shud get: [B,1], [A,3]

There are many reasons why you would want to do this. For eg:
- Reducer processing logic is based on grouping keys not based keys(hashes being same) - but something custom to you, so u want to group such keys together.
- Reducers create their own output files (one each). So if u want output files such that it contains data relating to certain keys.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top