Question

When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.

This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.

Is there an existing framework that already does something like this or an efficient file comparison algorithm?

Was it helpful?

Solution

XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.

OTHER TIPS

Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.

It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.

Here's a Java library that can compute diffs between two plain text files:

http://code.google.com/p/google-diff-match-patch/

I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)

As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.

It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).

Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top