Question

We have a couple of SGE clusters running various versions of RHEL at my work and we're testing a new one with a newer Redhat, all . On the old cluster ("Centos release 5.4"), I'm able to submit a job like the following one and it runs fine:

echo "java -Xms8G -Xmx8G -jar blah.jar ..." |qsub ... -l h_vmem=10G,virtual_free=10G ...

On the new cluster "CentOS release 6.2 (Final)", a job with those parameters fails due to running out of memory, and I have to change the h_vmem to h_vmem=17G in order for it to succeed. The new nodes have about 3x the RAM of the old node and in testing I'm only putting in a couple of jobs at a time.

On the old cluster, I'd set the -Xms/Xms to be N, I could use N+1 or so for the h_vmem. On the new cluster, I seem to be crashing unless I set h_vmem to be 2N+1.

I wrote a tiny perl script that all it does is progressively use consume more memory and periodically print out the memory used until it crashes or it reaches a limit. The h_vmem parameter makes it crash at the expected memory usage.

I've tried multiple versions of the JVM (1.6 and 1.7). If I omit the h_vmem, it works, but then things are riskier to run.

I have googled where others have seen similar issues, but no resolutions found.

Was it helpful?

Solution

The problem here appears to be an issue with the combination of the following factors:

  1. The old cluster was RHEL5, and the new RHEL6
  2. RHEL6 includes an update to glibc that changes the way MALLOC reports memory usage of multi-threaded programs.
  3. the JVM includes a Multi-threaded garbage collector by default

To fix the problem I've used a combination of the following:

  • Export the MALLOC_ARENA_MAX environment variable to a small number (1-10) e.g. in the job script. I.e. include something like: export MALLOC_ARENA_MAX=1
  • Moderately increased the SGE memory requests by 10% or so
  • Explicitly set the number of java GC threads to a low number by using java -XX:ParallelGCThreads=1 ...
  • Increased the SGE thread requests. E.g. qsub -pe pthreads 2

Note that it's unclear that setting the MALLOC_ARENA_MAX all the way down to 1 is the right number, but low numbers seem to work well from my testing.

Here are the links that lead me to these conclusions:

https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

What would cause a java process to greatly exceed the Xmx or Xss limit?

http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-glibc/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top