Controling and monitorying number of simultaneous map/reduce tasks in YARN

https://stackoverflow.com//questions/22069904

23-12-2019
|

Question

I have an Hadoop 2.2 cluster deployed on a small number of powerful machines. I have a constraint to use YARN as the framework, which I am not very familiar with.

How do I control the number of actual map and reduce tasks that will run in parallel? Each machine has many CPU cores (12-32) and enough RAM. I want to utilize them maximally.
How can I monitor that my settings actually led to a better utilization of the machine? Where can I check how many cores (threads, processes) were used during a given job?

Thanks in advance for helping me melt these machines :)

Solution

1.
In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties dictated how many map and reduce slots each TaskTracker had.

These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, which control the amount of memory and CPU on each node, both available to both maps and reduces

Essentially:
YARN has no TaskTrackers, but just generic NodeManagers. Hence, there's no more Map slots and Reduce slots separation. Everything depends on the amount of memory in use/demanded

Using the web UI you can get lot of monitoring/admin kind of info:

NameNode - http://:50070/
Resource Manager - http://:8088/

In addition Apache Ambari is meant for this: http://ambari.apache.org/

And Hue for interfacing with the Hadoop/YARN cluster in many ways: http://gethue.com/

OTHER TIPS

There is a good guide on YARN configuration from Hortonworks
You may analyze your job in Job History server. It usually may be found on port 19888. Ambari and Ganglia are also very good for cluster utilization measurement.

I've the same problem, in order to increase the number of mappers, it's recommended to reduce the size of the input split (each input split is processed by a mapper and so a container). I don't know how to do it,

indeed, hadoop 2.2 /yarn does not take into account none of the following settings

<property>
    <name>mapreduce.input.fileinputformat.split.minsize</name>
    <value>1</value>
</property>
<property>
    <name>mapreduce.input.fileinputformat.split.maxsize</name>
    <value>16777216</value>
</property>

<property>
    <name>mapred.min.split.size</name>
    <value>1</value>
</property>
<property>
    <name>mapred.max.split.size</name>
    <value>16777216</value>
</property>

best

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow