What is the best hash function for Rabin-Karp algorithm?

https://stackoverflow.com/questions/11546791

21-06-2021
|

Question

I'm looking for an efficient hash function for Rabin-Karp algorithm. Here is my actual code (C programming language).

static bool f2(char const *const s1, size_t const n1, 
               char const *const s2, size_t const n2)
{
    uintmax_t hsub = hash(s2, n2);
    uintmax_t hs   = hash(s1, n1);
    size_t   nmax = n2 - n1;

    for (size_t i = 0; i < nmax; ++i) {
        if (hs == hsub) {
            if (strncmp(&s1[i], s2, i + n2 - 1) == 0)
                return true;
        }
        hs = hash(&s1[i + 1], i + n2);
    }
    return false;
}

I considered some Rabin-Karp C implementations, but there are differences between all the codes. So my question is: what are the characteristics that a Rabin-Karp hash function should have?

Solution

A extremly good performing hash is the bernstein hash. It even outruns many popular hashing algorithms.

unsigned bernstein_hash ( void *key, int len )
{
    unsigned char *p = key;
    unsigned h = 0;
    int i;

    for ( i = 0; i < len; i++ )
        h = 33 * h + p[i];

    return h;
}

Of course, you can try out other hashing algorithms, as described here: Hash function on NIST

Note: It has never been explained why the 33 is performing so much better than any other "more logic" constant.

For your interest: Here is a good comparison of different hash algorithms: strchr comparison of hash algorithms

OTHER TIPS

For the problem with the small alphabets, such as nucleic acid sequence searching (e.g. alphabet = {A, T, C, G, U}), nt-Hash may be a good hash function. It uses the binary operation, which is faster, and rolling hash update, and it also gives uniform distributed hash values.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow