Storing object-graphs with class-evolution in Java with transformation (long time archiving)

https://softwareengineering.stackexchange.com/questions/218767

01-10-2020
|

Question

Abstract

A common problem is to store objects (with graph), and to load them back. This is easy as long the stored object representation matches the executing code. But as time goes by, requirements change and stored objects do not match any longer the code. The stored objects should not loose their data, and the code in the clients should work with the latest object-models.

So a transformation has to occur somehow between loading the data and returning a object to the client.

I know that there exist some libraries such as XStream, gson, protobuf and avro. They could load older objects, but afaik just ignore data that does not match any longer the fields in the class (maybe I missed something).
(When I'm talking about storing and serialization I do not mean Java's built in serialization mechanism.)

Question

So what is the question? I'm now researching for some time, and there seems no library that adresses this issue (class evolution) without data-loss. I hope to find another working solution or idea how to implement it by myself here.

I have some requirements:

File-based - I want to be able to store the serialized object on disk
Appendable - I want to append multiple objects to one file without loading the whole file in memory again and again
Support for multiple versions in one file - A file could contain objects with different version (only of the same type)
Transformation - Data should be accessible by using the same type, even when changed in between.
Generic - The mechanism itself has to be generic, so I could use it for different objects (different objects do not get mixed in one file, only different versions of one type).

It would be nice if the storing format is human-readable.

Already stored objects are not updateable (at least not without huge effort). Think of an long-term archive.

Example

I could give an example for better illustration. Let's suppose we have two Pojos that we want to serialize.

public class MyPojo {
    String text;
    Long number;
    Integer[] values;
    SubPojo pojo;
}

public class SubPojo {
    List<String> items;
}

In the next version we might have renamed a field (text->content), changed a type (Integer[]->List), and have transformed a field to List (SubPojo->List) where the previous field now is the first element of the new list (not loosing data, just transforming to new representation).

public class MyPojo {
    String content;
    Long number;
    List<Integer> values;
    List<SubPojo> pojo;
}

public class SubPojo {
    List<String> items;
}

Some pseudocode how the client might take use of this:

// Write
Serializer ser = new Serializer();
MyPojo pojo = new MyPojo();
pojo.xxx = ...; // set fields
ser.store(pojo, file, append);

// Read (a version later)
Serializer ser = new Serializer();
ser.registerTransformer(new TransformerV1ToV2());
List<MyPojo> pojos = ser.load(file);

This approach has some drawbacks:

The transformer have to work on some kind of intermediate format (could be the backed stored format such as json or xml)
You don't know how the class format was at some point in time, since you're only transforming relative to the previous version and mapping to the final class, making search for errors hard
Performance (depending on how the transformation happens)

Solution 2

I decided to use the transformation approach. JSON is used as intermediate-format, this way I can ensure the correctness using Json Schema between versions (optional).

Conceptional you can think in two ways if you like: As object-mapper, or as json-format that is expressed in java-objects.

I open-sourced my solution at http://github.com/galan/verjson

The documentation will be extended in the next time. I also added support for custom-serializer/deserializer and polymorph types.

OTHER TIPS

I use the Externalizable interface to solve this problem. It doesn't really meet your Appendable requirement, but it might get you started.

Externalizable lets you write each object out yourself. What I do is include a version number for each class. Then, when I change the class, I adjust the readExternal method so it can can read the new format as well as the old ones and change the writeExternal method to write the new format with the new version number. The code to read old formats may need to change to fill new fields (and not fill in removed fields). Then I'm ready to go.

There's usually some sort of object hierarchy in the graph, so if a level 2 class is seriously modified, the new write/read for that class can output/input something completely different from the old one, with completely new lower-level classes.

Some care is needed because sometimes there isn't a hierarchy. You have to remember that, after reading, you can't set a field by pulling data from objects in fields because they may not really be there yet. (Sometimes I add a setUp method to all my classes that gets called after the top-level objectRead so the objects in the graph can get themselves organized after the whole graph has been read.

And it sometimes helps to write out objects manually using a method that just writes out plain bytes. (But on the read you have to know precisely what you are reading, where readExternal will read a complete object of any class.)

Code looks something like:

// version 3 -- November 16, 2013
// version 2 -- March 22, 2013
// version 1 -- April 1, 2012
public void writeExternal( ObjectOutput out )  throws IOException  {
    out.writeShort( 3 )
    out.writeLong( longData );
    out.writeObject( something );
    ....
}

public void readExternal( ObjectInput in )  throws IOException,
             ClassNotFoundException  {
    short version = in.readShort();
    if (version > 3)  {
        // Admit program is too old to read file.
    }
    else if (version == 3)  {
        longData = in.readLong();
        something = readobject();
    }
    else if (version == 2)  {
        longData = 5;
        something = readobject();
    }
    else
       ...
}

I would start by creating a class of general-purpose serializable objects that are capable of standing in for any object that needs to be serialized. Each object would contain a representation of every field that needs to be serialized, and the type and version of the object. You may want to use JSON or something similar for this part.

Each serializable class would need a constructor that takes a general-purpose object and throws an exception if the type, version, or fields don't match what the class needs. For the other direction each class would need a method that converts an object of that class into a general-purpose object. If you want your serialization to include multiple references to a single object then you can pass around an additional object to take care of mapping between POJOs and general-purpose objects.

Finally each serializable class needs a static method for converting from old versions to the current version. For the first version this would do nothing, but every time you modify the class you would increment the current version and add a conversion step that converts a general-purpose object for the previous version into a general-purpose object for the new version. You could have a switch statement with no breaks that switches on the version of the given object. The first case would convert an object of the first version into an object of the second version, followed by a case for converting the second version into the third version, and then you can fall all the way through until you have an object of the current version at the bottom. Each new version just requires adding a single case to the bottom of the switch statement that represents the changes that have been made.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange