My goal is to efficiently perform an exhaustive diff of a directory tree. Given a set of files F
, I must compute the equivalence classes EQV = [F₁, F₂, F₃, ... Fₙ]
such that fⱼ, fₖ ∈ EQV[i] iff (fⱼ is identical to fₖ) for all i, j, k
.
My general approach is to start with just one big class containing all initial files, EQV₀ = [[f₁, f₂, ..., fₙ]]
, and repeatedly split it into more refined classes EQV₁
, EQV₂
... EQVₘ₋₁
, based on some m
heuristics, for example, file size, checksum function 1, checksum 2. After all m
heuristics have been applied (EQVₘ₋₁
), a pairwise diff of all the files within each class in EQVₘ₋₁
must be made. Because this last step is quadratic for each of the classes in EQVₘ₋₁
, ie
O(sum(n² for n in map(len, EQVₘ₋₁)) )
and will probably be the bottleneck of my algorithm if each the m splits are done in linear time, my goal is to make EQVₘ₋₁
as flat as possible.
I would like to have access to a variety of good hash functions that I can apply to minimize collisions on EQVₘ₋₁
. My current idea is to use some library provided checksum function, such as adler, and to generate variations of it by simply applying it to different starting bytes within the file. Another one is to first apply fast hash functions, such as adler, and then more expensive ones such as md5 on only the classes which are still too large.
Considering that I can compute all the hashes for a given file in just one read of that file, how could I compute a variety of hashes that will help me discriminate among non-identical files?
Alternatively, what is a good list of hash functions available in python that aren't cryptographically secure?
Edit:
Another idea seems to use "rabin fingerprint" based on a fixed set of randomly generated inputs. Would this make sense for this purpose?
http://en.wikipedia.org/wiki/Rabin_fingerprint