Question

unsigned int HashString( const char *string ) {
    const char* p;
    unsigned hash = 40503;

    for ( p = string; *p != '\0'; ++p ) {
        hash += *p;
        hash += ( hash << 10 );
        hash ^= ( hash >> 6 );
    }
    hash += ( hash << 3 );
    hash ^= ( hash >> 11 );
    hash += ( hash << 15 );

    return hash;
}

Just wandering over their code. I've never seen a hashing function like this before though.

I'm not too seasoned when it comes to bitwise ops, I know how bit shifting and masking works, but only in a rudimentary scenario like checking if bits are set.

What exactly does this do?

Was it helpful?

Solution

Read here for a general overview, and go down for "One-at-a-Time hash" (by Jenkins), which coincides with this one.

Also see this Wikipedia entry, mentioned in this answer.

"how exactly is this a good hash?" Not exactly. Those shifts are a little arbitrary, justified mostly from some heuristics and empirical tests.

OTHER TIPS

This sort of thing will be much easier to understand when you've got a broader understanding of binary arithmetic in general. It's way easier to go from the math to the code than the other way around.

I'm not having much luck finding a good online resource, but I was very happy with an earlier edition of this textbook when I was in school. You may be able to find some online lecture notes from a good CS class on binary arithmetic as well.

This site may give you a lead into hash theory in general. I wish I could recommend a textbook there, but I've yet to come across a really lucid number theory textbook.

Who says it hashes well?

A hash function maps an input, which in this case is a string, to an output, in this case an unsigned int. The size of the input is (number of usable characters) ^ number of characters in the string where ^ is "raised to the power of".

If your input string can contain only the characters 0 and 1 then the size of the input would be 2^ number of characters in the string

The size of the output is fixed, at the largest number representable in an unsigned int.

This means that there is a "number of characters in the string" where the size of the input will be greater than the size of the output. By the pigeon hole principle you will definitely start having collisions. In reality, you probably had collisions before this threshold is reached.

If you want to use a hash function in your hash_map or any other data structure, make sure it is tuned to your specific input. Don't go picking up the first one you find in the internet. A good hash function provides as few collisions as possible for your specific inputs.

A general purpose hash function may be sub-optimal in your specific case. A hash function specifically designed for some inputs (and this may very well be such a function) may work significantly worse on your inputs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top