Question

If I calculate the probability of collisions using:

def collision(hash_t, items):
    prob = 1.0
    for i in range(1, items):
        prob *= (hash_t - i) / float(hash_t)
    return 1 - prob

Is there an easy way to create a model that will calculate the cost of lookups and insertions in the hash table based on the probability of collisions, so then I could decide the optimum size based on the speed v memory allocation?

Was it helpful?

Solution

While this depends on your hashing function + data type (to determine how the hashing takes place), the size of your hash entries (which with python can vary from 32-bit systems vs. 64-bit systems), your collision handling strategy, and your time/memory requirements, the following is a good rule of thumb:

Use a load factor of 2/3.

That is, have a hash table that is 3/2 the size of your number of elements. So if you have 1,000 elements, you have 1,500 entries. If each hash element is 32 bits (assumption based on a 32-bit python installation, if this is wrong someone correct me) then you are wasting almost 2 kB, which is truly a tiny amount of memory. If you had 2,000,000 entries, you would waste almost 4 MB which also is tiny.

In short, the common consideration in hash tables is rarely space, but rather time. Python's own implementation used with dicts uses a 2/3 load factor maximum before expanding the hash table size. This is based upon a performance degradation with many collision strategies performing poorly around 70% or 80% load factor.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top