How to handle a dict variable with 2^50 elements?

https://stackoverflow.com/questions/10089351

30-05-2021
|

Question

I have to find SHA256 hashes of 2^25 random strings. And then look for collision (using birthday paradox for the last, say, 50 bits of the hash only).

I am storing the string:hash pair in a dict variable. Then sorting the variable with values (not keys) and then looking for collision using a O(n) loop.

The problem is that since there are 2^25 strings and their 2^25 hashes, so the dict variable has 2^50 values in it. This is EXTREMELY resource intensive. So, how do I do this with limited RAM, say, around 1GB?

What I have already tried:
1. Running this with a 6GB swap space. The program ran overnight and was still not done. This is essentially even slower than an O(n_square) search! The hashes are calculated with RAM usage of about 3.2 GB. But after that when it comes to the sort command, the RAM used starts shooting up again! I though the python sort uses In-Place-Quicksort :(
2. I have stored only the hashes and found a collision. But could not find the corresponding string since did not store it.

I am not supposed to use databases etc. At the most a text file but that doesn't help. Also, I am pretty new to Python but do not let that limit the level of your answer.

PS: this is an assignment. Some claim to have found the collisions in under one minute with 300MB RAM. Don't know if that is true. I have solved the problem but the answer is unreachable! At work so do not have access to code right now. Will add soon.

Code:

from Crypto.Hash import SHA256
import os
import random
import string
from operator import itemgetter

def shaa():

    trun=[]
    clist={}
    for i in range(0,33554432):

        sha=SHA256.new(str(i)).hexdigest()
        sha=int(bin(int(sha,16))[-50:],2)
        clist[i]=sha

    print 'Hashes done.'

    clist=sorted(clist.items(), key=itemgetter(1))  
    for i in range(0,33554432):

        if(clist[i]==clist[i+1]):
            #print string[i],string[i+1]
            print clist[i]
            return 1
    return 2

result=2
while(result==2):
    result=shaa()

Solution

I'd go for something like this:

Open 16 files (opened in binary mode should be fine; this will be easiest if all your strings have the same length). Generate your strings and hashes, and write them to a file depending on the first 4 bit of the hash. Then load and process each file separately. This will reduce memory usage by a factor of 16. (Of course you can use any number of files as long as you don't run out of file handles. Having to open and close each file on every access will become rather slow.)

If generating the strings and hashes is relatively inexpensive, you don't even need the files. Simply do 16 passes, and in each pass keep only thoses hashes the upper nibbles of which match the pass number.

OTHER TIPS

One way to solve the problem is to use a very long bitfield, so that every hash is mapped to certain position in 2^25 bits long memory block.

A better, yet non-100%-assurance manner to solve this kind of problems is done via Bloom filter or other probabilistic data structures.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not; i.e. a query returns either "inside set (may be wrong)" or "definitely not in set".

Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.

A Bloom filter with 1% error requires only about 9.6 bits per element — regardless of the size of the elements.

So, 9.6 bits per 2^25 elements, will need just 38.4 MiB of memory.

I think the key insight here -- which I admit evaded me for some time, till I came back to this a few hours later -- is that the sha256 hash digest is its own hash. In other words, you don't have to do any additional hashing or set creation. All you have to do is create a custom hash table, using the sha256 digest as the hash. To save space, don't store the strings; just create a bit-array (using shift operations on integers in an array of ints created with table = numpy.zeros(table_size / bits_per_int + 1, dtype='i')) to detect collisions, and then save colliding strings in a dict mapping hashes to strings for lookup in a second pass.

table_size should be a large prime -- I picked one slightly larger than 2 ** 31, which made for a 268MB table -- because that will produce few new collisions/false positives (introduced by the modulo operation on the hash). You can save the strings themselves to a text file, which you can iterate over.

So for any string, the index of the corresponding bit to set would be index = int(hashlib.sha256('foo').hexdigest(), base=16) % table_size. Then calculate the major_index = index / bits_in_int and minor_index = index % bits_in_int, use shift and bitwise operations on minor_index to check and store the correct bit in the int at table[major_index], and so on.

Now do a pass through the strings. Whenever a string generates a hash that maps to a bit that has already been set, store a hash:string pair in a dictionary. Or better yet, store a hash:[string_list] pair, appending new strings to the list in the case of multiple collisions. Now for any colliding pair of strings (a, b), the dictionary will contain the hash and value of b. Then do a second pass through the strings, hashing each in turn, and checking the dictionary for each hash. If the hash is in the dictionary, and the string is not already in the corresponding list, add the string to the list. Some of the strings in the dictionary will not correspond to true collisions; the [string_list] for most of those hashes will be only one-item long, and those hash:[string_list] pairs may be discarded. The others are probably true collisions resulting from the hash function itself, rather than from the modulo operation. However, you may still have a few false positives to weed out, in those cases where there was both a true and a false positive; you'll have to double-check the resulting lists for false positives.

BasicWolf's suggestion of using a bloom filter is a good one, and might result in a smaller table. But it adds a lot of complication; I didn't bother. I tried the above method on newline-terminated strings from '0\n' to '33554431\n' and found two hashes with a 54-bit overlap. It took 11 minutes, and the maximum memory usage was about 350MB (though that could probably be reduced.) I did some profiling and found that most of the time was spent calculating offsets for the bit-table. Coding this in c would probably provide a significant speed-up, and pre-hashing and storing the hashes as well as the strings would help too.

In fact, I tried pre-hashing the strings, and replaced my rather ad-hoc numpy-based bitarray with a bitarray from the c-based extension module of the same name. This reduced the runtime to just over 2 minutes, while keeping the memory profile just around 350MB.

Close enough for government work, I think. Since this is an assignment, I won't post the code, but I'm happy to provide additional pointers.

Why don't you use a dictionary from last-50-bits-of-hash to string?

Split the hashed for example in groups of 10 chars. And nest the values this way you will have recursive search but it should be faster

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow