estimation of big comparison task

https://stackoverflow.com/questions/17165439

01-06-2022
|

Question

I have a programming task from the university which requires finding shared strings of a constant length by comparing of hundreds of files (good files and bad files, smaller than one megabyte) byte by byte.

Suppose I'm going for a total coverage of the comparisons, and I actually compare each file to other files, is it possible to actually complete this task within a few minutes?

I've tried the naive algorithm and I've been improving on it for several days now, and I can't seem to drop under a few hours.

What I did so far:

CPU:

I've locally bench-marked different comparisons and buffer sizes to see which fits my needs best.

I don't keep the signature itself, only a reference to it (via a boolean array that has the same size of the file - also helps me not to compare again an index that has been ruled out).

I'm currently installing callable comparison tasks into the system with hope it won't create too much overhead or synchronization issues.

Virtual Memory:

I'm determining buffer sizes according to the free memory available (System.freeMemory() - about 2GB after manually specifying it) to prevent thrashing and I've settled for a reasonable (to my view) trade-off between information saved per file

Algorithm:

I try and compare only a subset of the bytes in suspicious location after a static analysis of the files' structures (JAR files, I didn't go into bytecode, because I have no idea how to extrapolate relevance from bytecode - I only compare the "classes.dex").

Given that this must be a common task, am I missing something very obvious? I've been told that hashing the signatures can be faster, but I doubt that's faster than waiting for the comparisons to end and storing them later, via the references (this is very fast once the comparisons themselves, which are the bottleneck, ends). To me hashing seems like a big VM hogging risk.

It was told that this should run within "Reasonable time" and the aim is to find the best (smallest) superset (covers most bad files and no good files) of the files (or close to it). It seems to me that I'm way off after hearing some people that claim to have completed it.

If more information is needed please ask, and I'll edit it into the post.

I am planning on using this implementation of a Trie, in case I forget to update this, I hope you who runs into this can utilize it (or others in this project) for your needs!

Solution

If you want to cover all Strings, what you are after is a trie. It is a tree where each node would be a byte of one of your Strings. The final node would report how many times the String appears.

If you have "Dog", "Dad", "Dod", "Dog", you end with something like

 D
 | -------
 |       |
 a       o-------
 |       |      |
 |       |      |
 d(1)    d(1)   g(2)

Since the Strings are of fixed lenght n, you will have at max 256^i nodes by each level i, so the total would be 256^0 + 256^1 + ... + 256^n (it is an upper limit) nodes.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow