Hadoop doesn't lend itself to this kind of control.
as Explained by pg 43-44 of this excellent book. The programmer has little control over:
- Where a mapper or reducer runs (i.e., on which node in the cluster).
- When a mapper or reducer begins or finishes.
- Which input key-value pairs are processed by a specific mapper.
- Which intermediate key-value pairs are processed by a specific reducer. (what you would like)
BUT
You can change number 4 by implementing a cleverly designed custom Partitioner
that splits your data just the way you want it so that it and distributes your load across reducers as expected. Check out how they implement a custom partitioner to calculate relative frequencies in chapter 3.3.