Translate a hashing algorithm from C to Python

Question 1

The issue here is that C's unsigned int automatically rolls over when it goes past UINT_MAX, while Python's int just keeps getting bigger.

The easiest fix is just to correct at the end:

return hash % (1 << 32)

For very large strings, it maybe a little faster to mask after each operation, to avoid ending up with humongous int values that are slow to work with. But for smaller strings, that will probably be slower, because the cost of calling % 12 times instead of 1 will easily outweigh the cost of dealing with a 48-bit int.

PHP may have the same problem, or a different one.

PHP's default integer type is a C long. On a 64-bit Unix platform, this is bigger than an unsigned int, so you will have to use the same trick as on Python (either % or &, whichever makes more sense to you.)

But on a 32-bit Unix platform, or on Windows, this is the same size as unsigned int but signed, which means you need a different trick. You can't actually represent, say, 4294967293 directly (try it, and you'll get -3 instead). You can use a GMP or BCMath integer instead of the default type (in which case it's basically the same as in Python), or you can just write custom code for printing, comparing, etc. that will treat that -3 as if it were 4294967293.

Note that I'm just assuming that int is 32 bits, and long is either 32 or 64, because that happens to be true on every popular platform today. But the C standard only requires that int be at least 16 bits long, and long be at least 32 bits and no shorter than int. If you need to deal with very old platforms where int might be 16 bits (or 18!), or future platforms where it might be 64 or more, you have to adjust your code appropriately.

Question 2

Mask the hash value with &:

def hash_string(str, _width=2**32-1):
    hash = 3151
    for c in str:
        hash = ((hash << 3) + (ord(c) << 2) * 3)
    return hash & _width

This manually cuts the hash back to size. You only need to limit the result once; it's not as if those higher bits make a difference for the final result.

Demo:

>>> hash_string("foo bar spam")
3355459964