The problem here appears to be an issue with the combination of the following factors:
- The old cluster was RHEL5, and the new RHEL6
- RHEL6 includes an update to glibc that changes the way MALLOC reports memory usage of multi-threaded programs.
- the JVM includes a Multi-threaded garbage collector by default
To fix the problem I've used a combination of the following:
- Export the MALLOC_ARENA_MAX environment variable to a small number (1-10) e.g. in the job script. I.e. include something like:
export MALLOC_ARENA_MAX=1
- Moderately increased the SGE memory requests by 10% or so
- Explicitly set the number of java GC threads to a low number by using
java -XX:ParallelGCThreads=1 ...
- Increased the SGE thread requests. E.g.
qsub -pe pthreads 2
Note that it's unclear that setting the MALLOC_ARENA_MAX all the way down to 1 is the right number, but low numbers seem to work well from my testing.
Here are the links that lead me to these conclusions:
What would cause a java process to greatly exceed the Xmx or Xss limit?
http://siddhesh.in/journal/2012/10/24/malloc-per-thread-arenas-in-glibc/