Question

We have a Java EE application with a heap size of several gigabytes on our production servers. Every once in a while, any one of our servers no longer reacts to any requests.

  • When the problem occurs, the GC log indicates that the server is spending much/most of its time running GCs which take 8 to 10 seconds (usually they take less than 1).
  • We never get any OutOfMemoryErrors.
  • The problem does not occur specifically when the heap reaches a certain heap size - in fact, it occurs with different heap sizes, none of them being even near the configured maximum.
  • The problem does not occur in a certain interval, at a certain time, user load or on specific server nodes. It seems completely random.
  • Heap dumps, even those taken from a server while it showed the problem, did not show anything that was obviously wrong.
  • Restarting the production servers each day seems to reduce the likelihood of the problem occurring, but does not fix it.
  • If we don't restart our servers each day, there's a big chance that the problem will occur on one of our 8 production servers within one to three days.

How would you start diagnosing this?

Configuration

Our JAVA_OPTS are the following: -Xms8096m -Xmx8096m -XX:MaxPermSize=512M -Dsun.rmi.dgc.client.gcInterval=1800000 -Dsun.rmi.dgc.server.gcInterval=1800000 -XX:NewSize=150M -XX:+UseParNewGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/path/to/gc.log

$ java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)

$ uname -a
Linux myhostname 2.6.18-274.3.1.el5 #1 SMP Tue Sep 6 20:13:52 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/version
Linux version 2.6.18-274.3.1.el5 (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-51)) #1 SMP Tue Sep 6 20:13:52 EDT 2011

$ cat /etc/issue
CentOS release 5.7 (Final)
Kernel \r on an \m

$ cat /proc/meminfo|grep "MemTotal"
MemTotal:     16279356 kB

GC log

This is an example GC log snippet from when the problem occurred:

111036.554: [GC 111036.555: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111036.555: [Tenured: 3629252K->3647971K(5526912K), 8.7565190 secs] 5840068K->3647971K(8014016K), 8.7567840 secs]
111055.691: [GC 111055.691: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111055.691: [Tenured: 3647971K->3667529K(5526912K), 8.7876340 secs] 5858787K->3667529K(8014016K), 8.7878690 secs]
111071.037: [GC 111071.037: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111071.037: [Tenured: 3667529K->3692057K(5526912K), 8.7581830 secs] 5878345K->3692057K(8014016K), 8.7584210 secs]
111088.407: [GC 111088.407: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111088.407: [Tenured: 3692057K->3638194K(5526912K), 10.7072790 secs] 5902873K->3638194K(8014016K), 10.7074960 secs]
111110.238: [GC 111110.238: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111110.238: [Tenured: 3638194K->3654614K(5526912K), 8.8021440 secs] 5849010K->3654614K(8014016K), 8.8023860 secs]
111128.115: [GC 111128.115: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111128.115: [Tenured: 3654614K->3668670K(5526912K), 8.8451510 secs] 5865430K->3668670K(8014016K), 8.8453600 secs]
111161.684: [GC 111161.684: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111161.684: [Tenured: 3668670K->3684080K(5526912K), 8.8156740 secs] 5879486K->3684080K(8014016K), 8.8159260 secs]
111186.669: [GC 111186.669: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111186.669: [Tenured: 3684080K->3639333K(5526912K), 10.6025350 secs] 5894896K->3639333K(8014016K), 10.6030040 secs]
111208.692: [GC 111208.692: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111208.692: [Tenured: 3639333K->3657993K(5526912K), 8.7967920 secs] 5850149K->3657993K(8014016K), 8.7970090 secs]
111235.486: [GC 111235.487: [ParNew: 2210816K->2210816K(2487104K), 0.0000090 secs]111235.487: [Tenured: 3657993K->3676521K(5526912K), 8.8212340 secs] 5868809K->3676521K(8014016K), 8.8214930 secs]
Was it helpful?

Solution

As you have a very old version of Java (almost four years old) and you appear to be getting into an error state, the first thing to try is a newer version such as Java 6 update 35. I suspect update 12 doesn't have Compressed Oops on by default which is an option which should save you some memory (and thus overhead)

OTHER TIPS

First of all : a thread dump ! Kill -3 the process and check the log outputs.

You can confirm the GC by seeing the GC threads running.

You use 8Go for the JVM. How many gigs for the RAM ? (I hope at least 12).

Thread dump will show you the Eden, From, To, Permgen sizes. This may help to find which memory space is having a problem.

It usually takes a bit of experimentation, but my guess is that you are seeing this problem every 30 minutes, right? When you have a full GC, like you have with those tenured lines, which does not seem to be caused by a full heap, they are very likely to be caused by System.gc(). Normally in the log it would be printed as (System), but because your VM is old, I am not sure. The GC log information changes every release. So to eliminate that, I strongly recommend to use "-XX:+DisableExplicitGC". It also ignores the call the DGC does.

The second option here might be that you have a memory leak/problem you failed to see in the dump. The lines say that you always have 2210816K in new, and old is still at 3676521K in the end. If all of the 2210816K are alive (what it looks like) then its impossible to move them to Tenured, because it will not fit (5887337K) so if it does not stop doing that you might get a GC Overhead exceeded message. But in that case you have to have those 6G live objects in heap when you took the dump

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top