MapReduce require all mappers to finish before combine stage

https://stackoverflow.com/questions/9868212

26-05-2021
|

Question

I recently had to run a job that required all the mappers to finish before passing the results to the combine stage (due to the way the processed files were structured). This feature is available to the reducer by configuring the following -

// force 100% of the mappers to conclude before reducers start
job.set("mapred.reduce.slowstart.completed.maps", "1.0");

I couldn't find any similar configuration for the combine stage. Eventually I split my job to 2 parts, with the combine stage acting as reducer, and my original reduce passed to job #2 (mapper2 simply passes the data w/o modifying it).

I was wondering - is there a way I missed to configure 100% map completion before combine? thanks.

Solution

There is no way to control this - the combiner may or may not run for any given map instance, in fact the combiner may run multiple times over the various spills of your map data.

There's a more detailed definition in Tom Whites book: "Hadoop the definitive guide":

http://books.google.com/books?id=Nff49D7vnJcC&pg=PA178&lpg=PA178&dq=hadoop+combiner+spill&source=bl&ots=IiesWqctTu&sig=V5b3Z2EVWp5JzIvc_Fzv1-AJerI&hl=en&sa=X&ei=QUJwT9XBCOna0QGOzpnlBg&ved=0CFMQ6AEwAw#v=onepage&q=hadoop%20combiner%20spill&f=false

So your combiner may run before your map even finishes

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow