Question

I have a situation where entire blocks of a file lie on one machine and default replication factor is 1.

In this scenario, If I start hadoop on my cluster, I feel that all my mapping tasks would run only on one machine since blocks are present only on that machine. Is it right? Are Local mapper task execution a constraint or just a priority?
If yes, is it possible to configure in such a way that mapper tasks are also run in other machines by copying the blocks to their local disks.?

Second question is, even if mapper tasks are run only in one machine , is it correct that reducers will be started in all other machines by copying the mapper's intermediate data.?

Was it helpful?

Solution

Data local execution is just a priority, not a constraint. Hadoop will spawn non-local mappers if there are free slots on other nodes. You can even launch more mappers for the same blocks of data running at the same time - this is called speculative execution - and the task that finishes first wins, and the others get killed.

As for reducers - they copy map output data over network in a phase called shuffle.

OTHER TIPS

The framework tries its best to keep the processing as local as possible. But there might be some cases when this doesn't follow. One is obviously the slot unavailability. Another case might be when your InputSplit spans across more than 1 block and each block resides on a different machine. In such a case the other piece of the InputSplit will be moved to the node where Mapper for this InputSplit was initiated so that the entire block gets processed by 1 Mapper.

Are Local mapper task execution a constraint or just a priority?

It's not a constraint. It is just to make things more efficient. Otherwise, how inefficient it would be to move your BigData from here to there in order to process it. And this is one of the fundamental principles of Hadoop.

If yes, is it possible to configure in such a way that mapper tasks are also run in other machines by copying the blocks to their local disks.?

Why would you do that? If you really want to run Mappers on multiple replicas of same block, you can switch speculative execution to true instead of copying the block from one place to another. This will run multiple Mappers on the same block on multiple machines and you'll get the output through the fastest mapper.

Second question is, even if mapper tasks are run only in one machine , is it correct that reducers will be started in all other machines by copying the mapper's intermediate data.?

Reducers can start on any node with free slots. Not all the machines necessarily though.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top