Вопрос

Most applications, especially databases, can sort and filter by small integers or floats much faster than they can do string comparisons.

Therefore I'm wondering if there is a hashing function that I can use to return a 32bit or 64bit number of a short string (about 5 - 40 characters) so that I can compare by integer instead of by string.

I first thought of crc32, but it seems it's much too small of a number and would result in possible collisions in less than 50,000 hashes (I need to do over a million).

I'm mostly interested in working in Python, PHP, V8 Javascript, PostgreSQL, and MySQL.

Это было полезно?

Решение

The problem that collisions become likely at 50k entries is inherent in all 32 bit hashes. If you read a bit on the Birthday problem you'll see that collisions become likely if you have around sqrt(HashSpace) elements, e.g. sqrt(2^32) = 64k for 32 bit hashes.


With 64 bit hashes collisions become much rarer. But I still don't feel too comfortable betting the correctness of my program on that.

Using an approximation from wikipedia:

We obtain a probability of 3*10-8 for 1 million elements, and 3*10-6 for 10 million elements.

You could use CRC64 for that. Or just truncate a crypto hash, such as md5 or sha1 to the desired length.


If a malicious person can choose the strings, breaking your program by deliberately creating collisions, you should at least switch to a keyed hash, such as HMAC.


Depending on what you're doing, you could also simply create an in-memory mapping between string and int where you simply increment a counter for each element you encounter. This gives you a perfect mapping without risk for collisions, but is only applicable in some scenarios.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top