Question

Here's my problem (I'm programming in C):

I have some huge text files containing DNA sequences (each file has something like 65 million rows and a size of about 4~5 GB). In these files there are many duplicates (don't know how many yet, but there should be many millions of them) and I want to return in output a file with only distinct values. Each string has a quality value associated, so if e.g I have 5 equal strings with different quality values I'll hold the best one and discard the other 4.

Reducing memory requirements and improving speed efficiency as far as I can is VITAL. My idea was to create a JudyHS array using an hash function in order to convert the String DNA sequence (which is 76 letters long and has 7 possible characters) into an integer to reduce memory usage (4 or 8 bytes instead of 76 bytes on many millions of entries should be quite an achievement). This way I could use the integer as index and store only the best quality value for that index. The problem is that I can't find an hash function that UNIVOCALLY defines such a long string and produces a value that can be stored inside an integer or even a long long!

My first idea for an hash function was something like the default string hash function in Java: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1], but I could obtain a maximal value of 8,52*10^59.. way tooo big. What about doing the same thing and store it in a double? Would the computation become a lot slower? Please note that I'd like a way to UNIVOCALLY define a string, avoiding collisions (or at least they should be extremely rare, because I would have to access the disk at every collision, quite a costly operation...)

Was it helpful?

Solution

You have 7^76 possible DNA sequences and want to map them to 2^32 hashes without collisions? Not possible.

You need a least log2(7^76) = 214 bits to do that, about 27 bytes.

I you can live with some collisions I would recommend to stick to CRC32 or md5 instead of inventing a new wheel again.

OTHER TIPS

The "simple" way to get a collision-free hash function for N elements is to use a good mixing function (say, a cryptographic hash function) and to truncate the size, so that the hash results live in a space of size at least N2. Here, you have 65 million rows -- this fits on 26 bits (226 is close to 65 millions) so 52 bits "ought to be enough".

You can try using a fast cryptographic hash function, even a "broken" one since this is not a security-related problem. MD4, MD5, SHA-1... then truncate the result to the first (or last) 64 bits, store that in a 64-bit integer type. Chances are that you will not get any collision among your 65 million rows; and if you get some, they will be very rare.

For optimized C implementations of hash functions, lookup sphlib. Use the provided sph_dec64le() function to "decode" a sequence of 8 bits into a 64-bit unsigned integer value.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top