hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609
That's only half the truth. In reality, if you would actually compute the value s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n
, the result would be a number whose representation would be about as long as the string itself, so you haven't gained anything. So what you actually do is to compute
(s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n) mod M
where M
is reasonably small. Thus your hash value will always be smaller than M
.
So what you do in practice is you choose M = 2^64
and make use of the fact that unsigned integer overflow is well-defined in most programming languages. In fact, multiplication and addition of 64-bit integers in Java, C++ and C# is equivalent to multiplication and addition modulo 2^64
.
It's not necessarily a wise choice to use 2^64
as the modulus. In fact you can easily construct a string with lots of collisions, thus provoking the worst case behaviour of Rabin-Karp, which is Ω(n * m)
matching instead of O(n + m)
.
It would be better to use a large prime as the modulus and get much better collision resistance. The reason why this is usually not done is performance: We would need to explicitely use modular reduction (add a % M
) to every addition and multiplication. What's worse, we can't even use the builtin multiplication anymore, because it could overflow if M > 2^32
. So we need a custom MultiplyMod
function, which is bound to be a lot slower than machine-level multiplication.
Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?
As I already mentioned, if you don't reduce using a modulus, your hash value will grow as large as the string itself, thus rendering it useless to use a hash function in the first place. So yes, using controlled overflow modulo 2^64
is correct and even necessary if we don't manually reduce.