Question

I'm using the mstor library to parse an mbox mail file. Some of the files exceed a gigabyte in size. As you can imagine, this can cause some heap space issues.

There's a loop that, for each iteration, retrieves a particular message. The getMessage() call is what is trying to allocate heap space when it runs out. If I add a call to System.gc() at the top of this loop, the program parses the large files without error, but I realize that collecting garbage 40,000 times has to be slowing the program down.

My first attempt was to make the call look like if (i % 500 == 0) System.gc() to make the call happen every 500 records. I tried raising and lowering this number, but the results are inconsistent and generally return an OutOfMemory error.

My second, more clever attempt looks like this:

try {
    message = inbox.getMessage(i);
} catch (OutOfMemoryError e) {
    if (firstTry) {
        i--;
        firstTry = false;
    } else {
        firstTry = true;
        System.out.println("Message " + i + " skipped.");
    }
    System.gc();
    continue;
}

The idea is to only call the garbage collector if an OutOfMemory error is thrown, and then decrement the count to try again. Unfortunately, after parsing several thousand e-mails the program just starts outputting:

 Message 7030 skipped.
 Message 7031 skipped.
 ....

and so on for the rest of them.

I'm just confused as to how hitting the collector for each iteration would return different results than this. From my understanding, garbage is garbage, and all this should be changing is how much is collected at a given time.

Can anyone explain this odd behavior? Does anyone have recommendations for other ways to call the collector less frequently? My heap space is maxed out.

Was it helpful?

Solution 5

The mstor library wasn't handling the caching of messages well. After doing some research I found that if you call Folder.close() (inbox is my folder object above) mstor and javaxmail releases all of the messages that were cached as a result of the getMessage() method.

I made the try/catch block look like this:

try {
    message = inbox.getMessage(i);
    // moved all of my calls to message.getFrom(),
    // message.getAllRecipients(), etc. inside this try/catch.
} catch (OutOfMemoryError e) {
    if (firstTry) {
        i--;
        firstTry = false;
    } else {
        firstTry = true;
        System.out.println("Message " + i + " skipped.");
    }
    inbox.close(false);
    System.gc();
    inbox.open(Folder.READ_ONLY);
    continue;
}
firstTry = true;

Each time the catch statement is hit, it takes 40-50 ms to manually clear the cached messages and re-open the folder.

With calling the garbage collector through every iteration, it took 57 minutes to parse a 1.6 gigabyte file. With this logic, it takes only 18 minutes to parse the same file.

Update - Another important aspect in lowering the amount of memory used by mstor is in the cache properties. Somebody else already mentioned setting "mstor.cache.disabled" to true, and this helped. Today I discovered another important property that greatly reduced the amount of OOM catches for even larger files.

    Properties props = new Properties();
    props.setProperty("mstor.mbox.metadataStrategy", "none");
    props.setProperty("mstor.cache.disabled", "true");
    props.setProperty("mstor.mbox.cacheBuffers", "false");   // most important

OTHER TIPS

You should not rely on System.gc() as it can be ignored by VM. If you get OutOfMemory it means VM already tried to run GC. You can try increasing heap size, changing sizes of generations in heap (say most of your objects end up in old generation, then you don't need much memory for young generation), review your code to make sure you are not holding any references to resources you don't need.

Calling System.gc() is a waste of time in the general sense, it doesn't guarantee to do anything at anytime, it is a suggestion at best and in most cases is ignored. Calling it after an OutOfMemoryException is even more useless, because the JVM has already tried to reclaim memory before the exception was thrown.

The only thing you can do if you are using third party code you can't control is increase the JVM heap allocation at the command line to the most that your particular machine can handle.

Get started with java JVM memory (heap, stack, -xss -xms -xmx -xmn...)

Here are my suggestions:

  • Increase heap space. This is probably the easiest thing to do. You can do this with the -Xmx. parameter.
  • See if the API to load messages provides a "streaming" option. Perhaps you don't need to load the entire message into memory at once.

Calling System.gc() won't do you any good because it doesn't guarantee that the GC will be called. In effect, it is a sure sign of bad code. If you're depending on System.gc() for your code to work, then your code is probably broken. In this case you seem to be relying on it for performance's sake and that is a sign that your code is definitely broken.

You can never be sure that the JVM will honor your request, and you can't tell how it will perform the garbage collection either. The JVM may decide to ignore your request completely (i.e., it is not a guarantee). Whether System.gc() will do what it's supposed to, is pretty iffy. Since its behavior is not guaranteed, it is better to not use it altogether.

Finally, you can disable explicit calls to System.gc() by using the -XX:DisableExplicitGC option, which means that again, it is not guaranteed that your System.gc() call will run because it might be running on a JVM that has been configured to ignore that explicit call.

By default mstor will cache messages retrieved from a folder in an ehcache cache for faster access. This caching may be disabled however, and I would recommend disabling it for large folders.

You can disable caching by creating a text file called 'mstor.properties' in the root of your classpath with the following content:

mstor.cache.disabled=true

You can also set this value as a system property:

java -Dmstor.cache.disabled=true SomeProgram
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top