Question

As we know, that during the copy phase of hadoop, each of the reduce worker processes read data from all mapper nodes and perform a merge of the already sorted data (was sorted during in-memory sorting on the mapper side) and work on their share of keys and their values.

Now, we also know that all the data corresponding to a particular will go to only one reducer.

My question is : How is the data split to be transferred to the reducers i.e. how is the partition size decided and by what process is it decided as the data is transferred using a pull mechanism instead of a push mechanism. An interesting challenge to counter here would have been to determine the overall size of the data as the data resides on multiple nodes (I am guessing that the job tracker/master process may be aware of the size and location of data for all the nodes, but I am not sure on that too).

Wouldn't it be a performance penalty in terms of parallel processing if the data is highly skewed and most of it belongs to a single key where there are 10 or more reducers. In this case, only one reducer process would be processing most of the data in a sequential fashion. Is this kind of a situation handled in Hadoop? If yes, how?

Was it helpful?

Solution

How is the data split to be transferred to the reducers i.e. how is the partition size decided and by what process is it decided as the data is transferred using a pull mechanism instead of a push mechanism. An interesting challenge to counter here would have been to determine the overall size of the data as the data resides on multiple nodes (I am guessing that the job tracker/master process may be aware of the size and location of data for all the nodes, but I am not sure on that too).

Splitting of data into partitions is governed by the logic written inside getPartition(KEY k, VALUE v, int numOfReducers) present in the Partitioner abstract class. The default Hadoop partitioner is the HashPartitioner. The behavior is to make use of the Object.hashCode() method of the key and performs a modulo against the numOfPartitions. You can write your Partitioner if you find that HashPartitioner is not that efficient in your case.

As map tasks complete successfully, they notify their parent TaskTracker of the status update, which in turn notifies the JobTracker. These notifications are transmitted over the heartbeat communication mechanism. This is how everything stays in sync.

Wouldn't it be a performance penalty in terms of parallel processing if the data is highly skewed and most of it belongs to a single key where there are 10 or more reducers. In this case, only one reducer process would be processing most of the data in a sequential fashion. Is this kind of a situation handled in Hadoop? If yes, how?

Yes. It is true. The MapReduce framework provides different types of Partitioner which you can choose as per your requirements :

  • HashPartitioner, the default partitioner
  • TotalOrderPartitioner. It provides a way to partition by range
  • KeyFieldBasedPartitioner. Thie partitioner provides a way to partition the data by parts of the key.

If you are still unsatisfied you can implement your own logic. See this if you need some help on how to write a custom partitioner.

HTH

P.S. : I didn't quite get the second part of your first question. Let me know if the answer is not clear or if you need any further clarification.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top