Question

We have a nice, big, complicated elastic-mapreduce job that has wildly different constraints on hardware for the Mapper vs Collector vs Reducer.

The issue is: for the Mappers, we need tonnes of lightweight machines to run several mappers in parallel (all good there); the collectors are more memory hungry, but it should still be OK to give them about 6GB of peak heap each . . . but, the problem is the Reducers. When one of those kicks off, it will grab up about 32-64GB for processing.

The result it that we get a round-robbin type of task death because the full memory of a box is consumed, which causes that one mapper and reducer to both be restarted elsewhere.

The simplest approach would be if we could somehow specify a way to have the reducer run on a different "group" (a handful of ginormous boxes) while having the mappers/collectors running on smaller boxes. This could also lead to significant cost-savings as well, as we really shouldn't be sizing the nodes mappers are running on to the demands of the reducers.

An alternative would be to "break up" the job so that there's a 2nd cluster that can be spun up to process the mappers collector's output--but, that's obviously "sub-optimal".

So, the question are:

  • Is there a way do specify what "groups" a mapper or a reducer will run upon Elastic MapReduce and/or Hadoop?
  • Is there a way to prevent the reducers from starting until all the mappers are done?
  • Does anyone have other ideas on how to approach this?

Cheers!

Was it helpful?

Solution

During a Hadoop MapReduce job, Reducers start running after all the Mappers are done. The output from the Map phase is shuffled and sorted before partitioning takes place to decide which Reducer receives which data. So, Reducers start running after the Shuffle/Sort phase has ended (after the mappers are done).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top