Hadoop 2.2.0 Streaming Memory Limitation

https://stackoverflow.com/questions/21933937

14-10-2022
|

Domanda

We have a very frustrating hadoop streaming memory problem, our setup:

hadoop 2.2.0 (yarn)
our compute nodes have about 7 GB of RAM
hadoop streaming starts a bash script wich uses about 4 GB of RAM
therefore it is only possible to start one and only one task per node

out of the box each hadoop instance starts about 7 hadoop containers with default hadoop settings. each hadoop task forks a bash script that need about 4 GB of RAM, the first fork works, all following fail because they run out of memory. so what we are looking for is to limit the number of containers to only one. so what we found on the internet:

yarn.scheduler.maximum-allocation-mb and mapreduce.map.memory.mb is set to values such that there is at most one container. this means, mapreduce.map.memory.mb must be more than half of the maximum memory (otherwise there will be multiple containers).

done right, this gives us one container per node. but it produces a new problem: since our java process is now using at least half of the max memory, our child (bash) process we fork will inherit the parent memory footprint and since the memory used by our parent was more than half of total memory, we run out of memory again. if we lower the map memory, hadoop will allocate 2 containers per node, which will run out of memory too.

we would be very happy for any help offered! Thanks!

edit: since this problem is a blocker in our project we are evaluating adapting the source code to solve this issue.

Soluzione

It seems that the solution is to set the initial memory footprint of hadoop children via:

<property>
  <name>mapreduce.map.child.java.opts</name>
  <value>-Xmx512</value>
</property>

and we also set following parameters to the same value, just to be sure (they set heap sizes of child processes):

yarn.app.mapreduce.am.command-opts

which sets the MR App Master processes heap size.

the number of hadoop containers can be set via the pattern described above. important to note: there must be at least the amount of mapreduce.map.child.java.opts free to be able to spawn child processes. we used:

mapreduce.map.memory.mb = yarn.scheduler.maximum-allocation-mb - mapreduce.map.child.java.opts

everything works smoothly now. hope this may help someone in the future!

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow