Question

I need to write a tool in C++ to determine the changed bits in a file compared against another file for replication. What would be the best method of accomplishing this?

I don't have a specific OS or library in mind, I'm open to suggestions. My primary goal is reducing the amount of network traffic involved in replicating.

Was it helpful?

Solution

Look at rsync - it splits the file into blocks, calculates a checksum for each block, and transmits only the checksum to determine if there are any changesto the destination before transmitting the block data only if necessary.

OTHER TIPS

If you can't use rsync as is, check librsync. It's old, but the code is easy to read and improve.

If you don't have the old and new versions of files on the same machine, then rsync-like algorithms are the way forward (see previous answers). If you do have both the old and the new versions of files on the same machine, you can then do better than rsync: generate compressed diffs and send them over the network.

For generating efficient diffs, have a look at VCDIFF (RFC 3284) binary delta compression. One good implementation is xdelta (www.xdelta.org). It's fairly easy to implement a decoder/decompressor if you want to avoid using xdelta on the receiving end because of license issues. Writing your own VCDIFF diff generator that will generate compact diffs is much more complicated (think searching for moved blocks as an example).

In VCDIFF the diffs can also be sourceless, meaning they decompress into the target file without any source file (the file to which a diff is applied) at hand -- in VCDIFF compressing a file is a special case of creating a compressed delta between two files. This is useful because you can use the same format regardless of whether the destination has a version of your file.

I would start by trying some implementation of diff (http://en.wikipedia.org/wiki/Diff)

suggestion: Use a hash function & a divide & conquer approach to narrow down the block of change(s). Not exactly a collision proof solution, but SHA-2 IMO could work for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top