Question

I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?

Was it helpful?

Solution 2

The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.

Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.

Hope this will clear your doubt.

OTHER TIPS

Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.

From the great "Hadoop: The Definitive Guide" book:

If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.

An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master. So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.

SET mapreduce.job.ubertask.enable=TRUE;

So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top