Question

I have started reading about Big Data and Hadoop, so this question may sound very stupid to you.

This is what I know.

Each mapper processes a small amount of data and produces an intermediate output. After this, we have the step of shuffle and sort.

Now, Shuffle = Moving intermediate output over to respective Reducers each dealing with a particular key/keys.

So, can one Data Node have the Mapper and Reducer code running in them or we have different DNs for each?

Was it helpful?

Solution

  1. Terminology: Datanodes are for HDFS (storage). Mappers and Reducers (compute) run on nodes that have the TaskTracker daemon on them.

  2. The number of mappers and reducers per tasktracker are controlled by the configs: mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum

Subject to other limits in other configs, theoretically, as long as the tasktracker doesn't have the maximum number of map or reduce tasks, it may get assigned more map or reduce tasks by the jobtracker. Typically the jobtracker will try to assign tasks to reduce the amount of data movement.

So, yes, you can have mappers and reducers running on the same node at the same time.

OTHER TIPS

You can have both mappers and reducers running on the same node. As an example, consider a single node hadoop cluster. In a single node hadoop cluster, the entire HDFS storage(Data Nodes, Name Nodes) and both the job tracker and the task trackers everything runs on the same node.

In this case both the mappers and reducers run on the same node.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top