Question

For load balancing reasons, I want to create more partitions than reducers in a Hadoop environment. Is there a way to assign partitions to a specific reducers and if so, where can I define them. I wrote a individual Partitioner and want now to address a specific reducer with specific partitions.

Thank you in advance for the help!

Was it helpful?

Solution

Hadoop doesn't lend itself to this kind of control.

as Explained by pg 43-44 of this excellent book. The programmer has little control over:

  1. Where a mapper or reducer runs (i.e., on which node in the cluster).
  2. When a mapper or reducer begins or finishes.
  3. Which input key-value pairs are processed by a specific mapper.
  4. Which intermediate key-value pairs are processed by a specific reducer. (what you would like)

BUT

You can change number 4 by implementing a cleverly designed custom Partitioner that splits your data just the way you want it so that it and distributes your load across reducers as expected. Check out how they implement a custom partitioner to calculate relative frequencies in chapter 3.3.

OTHER TIPS

The portioning is done for the reducers. As many partitions are created as the number of reducers chosen. You can choose the number of reducers by

job.setNumReduceTasks(n);

The number n need not be limited by the physical reducer number you have. There will only be some wait to get the next reduce slot. In your partitioner code, you can implemengt the logic required to assign a key to specific partition.

However I do not see achieving any efficiency by going beyond the number of physically available reducer slots as it will only result in wait for the next reduce slot.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top