The best alternative for String flyweight implementation in Java

https://stackoverflow.com/questions/2915511

04-10-2019
|

Question

My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).

I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.

What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?

EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?

Solution

Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.

I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.

Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.

For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

But look at the data actually being stored:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

OTHER TIPS

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.

I.e. don't do:

String s1 = new String("foo");
String s2 = new String("foo");

This would create two instances in the heap. Rather do so:

String s1 = "foo";
String s2 = "foo";

This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).

Also don't use += to concatenate strings (in a loop):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

The += implicitly creates a new String in the heap everytime. Rather do so

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.

And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.

Illustration:

You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:

B
+-e
  +-er
  +-ans
+-lood

Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

Java 7/8

If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.

The implementation of subString() has changed.

Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.

First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.

Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.

There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.

Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow