Update data only by difference between files (delta for java)

Question 1

So, what I ended up doing was using this:

https://code.google.com/p/xdeltaencoder/

From my test it works really really well. However, you will need to make sure to checksum the source (in my case fileAJson), as it does not do it automatically for you!

Anyways, code below:

//Create delta
String[] deltaArgs = new String[]{fileAJson.getAbsolutePath(), fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Apply delta
deltaArgs = new String[]{"-d", fileAJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

//Trivia, Surpisingly this also works
deltaArgs = new String[]{"-d", fileBJson.getAbsolutePath(), fileDelta.getAbsolutePath(), fileBTarget.getAbsolutePath()};
XDeltaEncoder.main(deltaArgs);

Question 2

Using a collection of "change events" rather than sending the whole array

A solution to this would be to send a serialized object describing the change rather than the actual array all over again.

public class ChangePair implements Serializable{
    //glorified struct
    public final int index;
    public final  byte newValue;

    public ChangePair(int index, byte newValue) {
        this.index = index;
        this.newValue = newValue;
    }

    public static void main(String[] args){

        Collection<ChangePair> changes=new HashSet<ChangePair>();

        changes.add(new ChangePair(12,(byte)2));
        changes.add(new ChangePair(1206,(byte)3));

    }
}

Generating the "change events"

The most efficient method for achieving this would be to track changes as you go, but assuming thats not possible you can just brute force your way through, finding which values are different

public static Collection<ChangePair> generateChangeCollection(byte[] oldValues, byte[] newValues){
    //validation
    if (oldValues.length!=newValues.length){
        throw new RuntimeException("new and old arrays are differing lengths");
    }

    Collection<ChangePair> changes=new HashSet<ChangePair>();

    for(int i=0;i<oldValues.length;i++){
        if (oldValues[i]!=newValues[i]){
            //generate a change event
            changes.add(new ChangePair(i,newValues[i]));
        }
    }

    return changes;
}

Sending and recieving those change events

As per this answer regarding sending serialized objects over the internet you could then send your object using the following code

Collection<ChangePair> changes=generateChangeCollection(oldValues,newValues);

Socket s = new Socket("yourhostname", 1234);
ObjectOutputStream out = new ObjectOutputStream(s.getOutputStream());
out.writeObject(objectToSend);
out.flush();

On the other end you would recieve the object

ServerSocket server = new ServerSocket(1234);
Socket s = server.accept();
ObjectInputStream in = new ObjectInputStream(s.getInputStream());
Collection<ChangePair> objectReceived = (Collection<ChangePair>) in.readObject();
//use Collection<ChangePair> to apply changes

Using those change events

This collection can then simply be used to modify the array of bytes on the other end

public static void useChangeCollection(byte[] oldValues, Collection<ChangePair> changeEvents){
    for(ChangePair changePair:changeEvents){
        oldValues[changePair.index]=changePair.newValue;
    }
}

Question 3

Locally log the changes to the byte array, like a little version control system. In fact you could use a VCS to create patch files, send them to the other side and apply them to get an up-to-date file;

If you cannot log changes, you would need to double the array locally, or (not so 100% safe) use an array of checksums on blocks.

Question 4

The main problem here is data compression.

Kamikaze offers you good compression algorithms for data arrays. It uses Simple16 and PForDelta coding. Simple16 is a good and (as the name says) simple list compression option. Or you can use Run Lenght Encoding. Or you can experiment with any compression algorithm you have available in Java...

Anyway, any method you use will be optimized if you first preprocess the data.

You can reduce the data calculating differences or, as @RichardTingle pointed, creating pairs of different data locations.

You can calculate C as B - A. A will have to be an int array, since the difference between two byte values can be higher than 255. You can then restore B as A + C.

The advantage of combining at least two methods here is that you get much better results.

E.g. if you use the difference method with A = { 1, 2, 3, 4, 5, 6, 7 } and B = { 1, 2, 3, 5, 6, 7, 7 }. The difference array C will be { 0, 0, 0, 1, 1, 1, 0 }. RLE can compress C in a very effective way, since it is good for compressing data when you have many repeated numbers in sequence.

Using the difference method with Simple16 will be good if your data changes in almost every position, but the difference between values is small. It can compress an array of 28 single-bit values (0 or 1) or an array of 14 two-bit values to a single 32-byte integer.

Experiment, it all will depend on how your data behaves. And compare the data compression ratios for each experiment.

EDIT: You will have to preprocess the data before JSON and zip compressing.

Create two sets old and now. The latter contains all files that exists now. For the former, the old files, you have at least two options:

Should contain all files that existed before you sent them to the other PC. You will need to keep a set of what the other PC knows to calculate what has changed since the last synchronization, and send only the new data.
Contains all files since you last checked for changes. You can keep a local history of changes and give each version an "id". Then, when you sync, you send the "version id" together with the changed data to the other PC. Next time, the other PC first sends its "version id" (or you keed the "version id" of each PC locally), then you can send the other PC all the new changes (all the versions that come after the one that PC had).

The changes can be represented by two other sets: newFiles, and deleted files. (What about files that changed in content? Don't you need to sync these too?) The newFiles contains the ones that only exist in set now (and do not exist in old). The deleted set contains the files that only exist in set old (and do not exist in now).

If you represent each file as an String with the full pathname, you safely will have unique representations of each file. Or you can use java.io.File.

After you reduced your changes to newFiles and deleted files set, you can convert them to JSON, zip and do anything else to serialize and compress the data.