Question

I recently had to run a job that required all the mappers to finish before passing the results to the combine stage (due to the way the processed files were structured). This feature is available to the reducer by configuring the following -

// force 100% of the mappers to conclude before reducers start
job.set("mapred.reduce.slowstart.completed.maps", "1.0");

I couldn't find any similar configuration for the combine stage. Eventually I split my job to 2 parts, with the combine stage acting as reducer, and my original reduce passed to job #2 (mapper2 simply passes the data w/o modifying it).

I was wondering - is there a way I missed to configure 100% map completion before combine? thanks.

Was it helpful?

Solution

There is no way to control this - the combiner may or may not run for any given map instance, in fact the combiner may run multiple times over the various spills of your map data.

There's a more detailed definition in Tom Whites book: "Hadoop the definitive guide":

http://books.google.com/books?id=Nff49D7vnJcC&pg=PA178&lpg=PA178&dq=hadoop+combiner+spill&source=bl&ots=IiesWqctTu&sig=V5b3Z2EVWp5JzIvc_Fzv1-AJerI&hl=en&sa=X&ei=QUJwT9XBCOna0QGOzpnlBg&ved=0CFMQ6AEwAw#v=onepage&q=hadoop%20combiner%20spill&f=false

So your combiner may run before your map even finishes

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top