Memory leaking without objects growing in number or size

https://stackoverflow.com/questions/8145913

01-03-2021
|

Question

On an IBM iSeries system, I have a Java program running - an application server with a web server component, all in-house developed. When running on the 32 bit or 64 bit J9 JVM (IBM Technology for Java) I have symptoms of a memory leak.

Note that no problems are seen running this software on the iSeries classic JVM, on multiple Sun/Oracle JVMs and on Linux JVMs. Heck, I routinely leave the identical software running for weeks at a time on my wife's entry-level laptop while I am working on my website - I can assure you if it was leaking memory it would be noticed on that thing.

If I just leave a plain-vanilla system running idle, with no applications configured (basically just the messaging system and a web server), the heap just continues to slowly grow, causing more memory to be allocated over time, with each GC cycle not quite collecting down to the previous level. The pattern is exactly the same for JVMs where there is no problem, except that on those the GC sweep always reduces the heap to its previous GC level.

enter image description here

But, if I pull a JVM system dump at startup after stabilizing and subsequent dumps after the allocated heap has grown significantly, differential comparison indicates that the are no more reachable objects after running for a week than there were at startup. The most recent one, after a week show 6 additional classes loaded and a few objects clearly related to that. Thorough reviews of all the live objects have shown nothing which leaps out at me as unexpected.

I have tried the optimized-for-throughput and the generational-concurrent garbage collectors.

So according to the job's heap size, we appear to be leaking, and according to heap dumps, nothing is leaking.

There are no JNI methods being invoked (other than native code running as part of the core JVM), and it's definitely the heap which is growing - I can see that clearly in the IBM WRKJVMJOB information as well as reported using JMX beans in my console log file.

I cannot, so far, connect to the active JVM using JMX tools like JVisualVM because, although the listen socket is created when properly configured, the connection is rejected, apparently at a protocol level (the TCP/IP stack shows an accepted connection, but the JVM bounces it).

I am confounded, and at a loss as to where to go next.

EDIT: Just to clarify; these results are all with an uninstrumented JVM because I cannot get JMX access to this JVM (we are working on that with IBM).

EDIT 2011-11-16 19:27: I was able to pull a GC activity report over 1823 GC cycles which includes specific counts for Soft/Weak/PhantomReference counts; there is no sign of runaway growth in those numbers. There is, however significant growth in the small object tenured space (the large object tenured space is empty). It's grown from 9M to 36M.

Solution

Having eliminated some careless memory waste of memory (though not any leaks) in my program, and tuned the GC better for our workload, I have brought down the runaway memory use to a tolerable level.

However, in the process I have demonstrated that the IBM J9 JVM used on the AS/400 (aka iSeries, Systemi, i5, et al) has an 1336 byte/minute leak totaling 2 MB/day. I can observe this leak with a variety of programs from a "one-line" test program all the way up to our application server.

The one-line test program is this:

public class ZMemoryLeak2
extends Object
{

static public synchronized void main(String... args) {
    try { ZMemoryLeak2.class.wait(0); } catch(InterruptedException thr) { System.exit(0); }
    }

}

And a separate test program that did nothing except monitor memory use via the JMX API showed conclusively that 1336 B is leaked at exactly 1 minute intervals, never to be reclaimed (well, not reclaimed after 2 weeks of running). OP Note: It was actually slightly different amounts on each variation of the JVM.

Update 2012-04-02: This was accepted as a bug by IBM a few weeks ago; it was actually found and patched in Java 5 about the middle of last year, and the patch for Java 6 is expected to be available in the next week or two.

OTHER TIPS

Great question. Thought I'd turn some of my comments into an answer.

You mention that an idle system grows in terms of memory. This is an important bit of information. Either there is some internal scheduled jobs (automations, timers, etc.) or external process monitoring that is causing object bandwidth. I would consider turning off monitoring to see if the graphs are affected. This may help you figure out which objects are part of the problem.
When the object is under load, I suspect there is a certain amount of object bandwidth. Your ultimate problem may be that the IBM JVM is not handling memory fragmentation as well as the other JVMs -- I'm surprised by this though. I would work with them to try various other GC options to see how you can address this. I would think that this would be easy to simulate if you wrote a test server that did a whole bunch of memory operations and see if over days the memory usage grows. This may demonstrate that it is time to migrate away from IBM JVMs. Again, this would surprise me but if what you say is true and the number or size of objects is not growing...
I would look at the graphs of the various memory sections. I suspect you are seeing old-gen space go up and down and survivor trickle up steadily. If it is true that the number of objects is not changing then @Stephen must be right about their internal size or something else it at work. Maybe the object accounting is failing to report them all for some reason.
I find that the gc JMX button on the memory tab does a more complete sweep. It should be equivalent to using System.gc() which you have tried. Just FYI.
It would be good to turn on GC logging output to see if you can see any patterns: http://christiansons.net/mike/blog/2008/12/java-garbage-collection-logging/ and http://java.sun.com/developer/technicalArticles/Programming/GCPortal/
Any chance you can increase the transaction throughput on the server without changing monitoring or internal automations? If you see the memory graphs change in slope then you know that it is transaction based. If not then your problems are elsewhere. Again, this is to help you find which objects may be causing problems.

Hope something in here is helpful.

One possible explanation is that you are seeing the build up of objects in a cache implemented using WeakReference or similar. The scenario goes like this:

The GC cycles that you see in the graph are collections of the new space, and are not causing the references to be broken. So the cache is continuing to grow and use more heap space.
When you take a snapshot, this causes a full GC to be run which (maybe) breaks the references, and frees up the cached objects.

(Note the "maybe". I'm not sure that this explanation holds water ...)

Another possible explanation is that your application has the same number of objects, but some of them are larger. For instance, you might have an array of some primitive type that you keep reallocating with a larger size. Or a StringBuilder / StringBuffer the keeps growing. Or (in some circumstances) an ArrayList or similar that keeps growing.

You know, you could be chasing a phantom here. It may be that the system dump is telling the truth and there is no storage leak at all. You could test that theory by reducing the heap size to a point where a real memory leak is likely to provoke an OOME relatively quickly. If I couldn't provoke an OOME that way, I'd be inclined to write this off as an interesting curiosity ... and move on to a real problem.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow