Question

Assume I have 100 files, of which 10 are the originals and the remaining are revisions and/or combinations of the originals (Imagine a git tree). Given any descendant file, how can I identify the file that is most likely the original source? Does the process change if there are 1m documents? 100m documents?

(This isn't homework -- just a non-programmer who can't find the right approach via google ;) )

Was it helpful?

Solution

This sounds like the perfect place to apply phylogenetic tree-building methods! These have been famously used to recover a history of the Bible.

I would recommend starting with a method like Neighbour Joining, which is pretty fast (cubic -- and yes, this is considered fast among this class of methods) and pretty accurate. All it needs is a distance matrix: for n files, this is an n*n table of numbers, each giving the distance between a pair of the files. Distances can be computed any way you want, but some make more sense than others. diff file1 file2|wc -c would be one crude way to calculate a distance between two files.

One caveat: many phylogenetic methods, including Neighbour Joining, build unrooted trees -- that is, they cannot infer an ancestor, but only the tree structure relating the different items (which are called taxa when they are biological species or individuals). Still, that should be enough to help you find the root. More sophisticated Maximum Likelihood models sometimes can infer rooted trees, but these tend to be geared towards the specifics of how DNA evolves over time.

OTHER TIPS

Two major building blocks:

Distance between Files

One approach is to consider the two files as two (big) strings, and compute the edit distance between them. See http://en.wikipedia.org/wiki/File_comparison For quickly determining that two files are different, you can try doing a md5sum hash on them.

Cluster similar files

Then, similar documents, i.e. documents with low edit distances between them, may be clustered. See http://en.wikipedia.org/wiki/Document_clustering and http://en.wikipedia.org/wiki/K-means_algorithm etc.

Note

This does not directly recognize ancestry relationship between files. Without additional data, I believe, it is impossible to derive that, since a newer revision may be arbitrarily larger, smaller, or different from a previous version. We are banking on the expectation that different revisions of the same file have comparatively smaller edit distances and would cluster together.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top