Question

If I have large object graph that contains many duplicate strings, is there a benefit to intern()ing the strings before serializing them? Will this reduce the amount of data transferred? Will the strings share pointers on the receiving end?

My guess is that the Strings would be de-duped before sending, thus reducing the size of the data and that they would all be represented by the same object on the receiving end, but that they would not actually be interned on the receiving end. (meaning there would be one new instance of the string created on each serialization 'transaction')

Was it helpful?

Solution

ObjectOutputStream keeps track of object graph (until reset), one object is only written once, even if it's reached through multiple references. Reducing objects by interning will definitely reduce bytes.

On the receiving end, the same object graph is recreated, so one string instance on the sending end becomes one string instance on the receiving end.

OTHER TIPS

It's easy enough to test:

import java.io.*;

class Foo implements Serializable {
    private String x;
    private String y;

    public Foo(String x, String y) {
        this.x = x;
        this.y = y;
    }
}

public class Test {
    public static void main(String[] args) throws IOException {
        String x = new StringBuilder("hello").append(" world").toString();
        String y = "hello world";

        showSerializedSize(new Foo(x, y));
        showSerializedSize(new Foo(x, x));
    }

    private static void showSerializedSize(Foo foo) throws IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ObjectOutputStream oos = new ObjectOutputStream(baos);
        oos.writeObject(foo);
        oos.close();
        System.out.println(baos.size());
    }
}

Results on my machine:

86
77

So it looks like the deduping doesn't happen automatically.

I wouldn't use String.intern() itself though, as you probably don't want all of these strings in the normal intern pool - but you could always use a HashSet<String> to create a "temporary" intern pool.

You can use this enhancement of ObjectOutputStream which implements deduping of String. The output should be compatible with the original version (not tested), so no special ObjectInputStream is required.

Note, that not String.intern() is used but a private and temporary internal Map, so your PermGenSpace is not flooded.

public class StringPooledObjectOutputStream extends ObjectOutputStream {
    private Map<String, String> stringPool = new HashMap<String, String>();
    public StringPooledObjectOutputStream(OutputStream out) throws IOException {
        super(out);
        enableReplaceObject(true);
    }

    @Override
    protected Object replaceObject(Object obj) throws IOException {
        if( !(obj instanceof String) )
            return super.replaceObject(obj);

        String str = (String)obj;

        String replacedStr = stringPool.get(str);
        if( replacedStr == null ){
            replacedStr = (String)super.replaceObject(str);
            stringPool.put(replacedStr, replacedStr);
        }
        return replacedStr;
    }
}

Before serializing there does not seem to be any benefit of interning the strings. Atleast this will not change anything for the serialization. It may help reducing the memory of your application.

On the receiving side at the lowest level readUTF() of ObjectOutPutStream or its equivalent will be called which will allocate new string for each call. If your class is externalizable you can do readUTF().intern() to save memory at the receiver side. I have used this method myself and got more than 50% reduction in memory usage of client application.

However note that if there are lots of unique strings then intern() may cause out of memory problem as it uses PermGen. See: http://www.onkarjoshi.com/blog/213/6-things-to-remember-about-saving-memory-with-the-string-intern-method/

I only interned strings that were smaller than 10 characters and haven't faced any problem.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top