what hashing should i use to generate random values from a set of strings

https://stackoverflow.com//questions/9606011

09-12-2019
|

Question

I have an array of fingerprints in hash buckets. I would like to insert into the bucket and search on it with out going from entry 0 to entry n.

What i want to do is, when i add entries into the buckets i use the fingerprint as an input to calculate a hash which i can use to determine which bucket to add into. This was not difficult, but when i try to hash the fingerprint using the same algorithm to identify into which slot in the bucket to add the fingerprint i see that it makes a lot of collisions.

Here is the code i used to hash the fingerprints into the buckets. I tried to use the same code with more characters but it still gives me higher collision.

he.fingerprint is 33 characters wide

number of buckets is 1024

number of entries per bucket is 2048

    char hph[32];
int bk,en;
unsigned long h = 0, g,i=0;
int j=0;
strncpy(hph,(const char*)(he).fing_print,32);

while ( j<32 ) 
{
    h  =h + hph[j]++;
     g = h & 0xFFf00000;
    h ^= g >> 24;
    h &= ~g;
    j++;
}
bk=h%buckets;
en=h%entries_per_bk;

Solution

There are some superfluous things in your hashing function.

char hph[32];
int bk,en;
unsigned long h = 0, g,i=0;
int j=0;
strncpy(hph,(const char*)(he).fing_print,32);

while ( j<32 ) 
{
    h = h + hph[j]++;

This is, effectively, h += hph[j];. The character at index j is incremented, but since it is never used again, that doesn't influence the hash at all. Perhaps you mean to preincrement it? But that wouldn't change much.

    g = h & 0xFFf00000;

The fingerprint (or at least the part of it you use) is 32 characters long at most. Each of those characters is less than 256, so the total sum is less than 32*256 = 8192 = 0x2000, hence h & 0xFFF00000 is 0. Thus the following two lines do exactly nothing to h.

    h ^= g >> 24;
    h &= ~g;
    j++;
}
bk=h%buckets;
en=h%entries_per_bk;

So effectively, your hash is the sum of the first 32 characters of the fingerprint. That doesn't spread your hashes well, similar strings generate similar hashes. You would obtain a better hash by multiplying the hash so far by a largish prime,

h = 0;
for(j = 0; j < 32; ++j)
    h = prime*h + hph[j];

so that small differences at any index (except the last, but you could multiply once more to spread those too) can create large differences of the hash.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow