Comparing Lists of Hashes for Document Similarity / Clustering

https://stackoverflow.com/questions/18726972

28-06-2022
|

Question

I am taking the following approach to computing similarity between multiple text file documents:

For each document in a given directory, break the document into chunks (long sequences of bytes computed from a Basic Sliding Window algorithm).
Calculate a fingerprint for each chunk, i.e. hash the chunk using a hash algorithm such as MD5
Compare chunks occurring in different files

So far, I have implemented the first two steps. I am using Google Guava's MultiMap to associate hashed chunks with file names. As for step 3, I am now looking to do the following:

For each file {
   get the file's hashed chunks
   compare the hashed chunks with those of every other file
}

The idea here is to compare ﬁle signatures for common entries and report only clusters of those ﬁles whose signature intersection is above a given similarity threshold. But I ultimately want to tweak this logic to get a similarity score for each file as it is compared to a source file.

My question: What is the best way to compare hashed chunks for step 3? Also, how can I incorporate the concept of similarity with the former? Here is my code thus far:

public class ContentBasedChunkingMain implements Similarity {

    Multimap<String, byte[]> hashMap = ArrayListMultimap.create();

    public ContentBasedChunkingMain() {
    }

    @Override
    public void getSimilarity(String dir) {

        // Document directory
        File directory = new File(dir);

        Collection<File> collection = FileUtils.listFiles(directory,
                TrueFileFilter.INSTANCE, TrueFileFilter.INSTANCE);

        ArrayList<File> files = new ArrayList<File>(Arrays.asList(collection
                .toArray(new File[0])));

        // For each file in the directory
        for (File f : files) {

            // Get chunks
            ArrayList<String> chunks = Chunker.getChunks(f);

            for (String s : sentences) {

                // MD5 Hash each sentence
                MD5HashFunction h = new MD5HashFunction();
                System.out.println(h.byteToHex(h.hash(s)));

                // Store filename and hashed chunk into MultiMap
                hashMap.put(f.getAbsolutePath(), h.hash(s));

            }

        }

        // Step 3    
        for (File f : files) {

            // for each file, get the file's hashed chunks
            Collection<byte[]> bytes = hashMap.get(f.getAbsolutePath());

            // compare ...

        }

    }

The algorithm that I am following, Content-Based Chunking Algorithm, is described in more detail in the following papers:

http://www.hpl.hp.com/techreports/2009/HPL-2009-90.pdf

http://webglimpse.net/pubs/TR93-33.pdf

http://www.hpl.hp.com/techreports/2005/HPL-2005-42R1.pdf

Solution

For the comparison, I took the intersection of the two hash chunk collections.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow