Question

I refer to the Rabin Karp Wikipedia article on Hash use.

In the example, the string "hi" is hashed using a prime number 101 as the base.

hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609 

Can such an algorithm be used practically in Java or C# where long has a maximum value of 9,223,372,036,854,775,807? Naively, to me it seems that the hash value grows exponentially and with a large enough N (being string length) will result in overflow of the long type. For example, say I have 65 characters in my string input for the hash?

Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?

Was it helpful?

Solution

hash("hi")= ASCII("h")*101^1+ASCII("i")*101^0 = 10609

That's only half the truth. In reality, if you would actually compute the value s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n, the result would be a number whose representation would be about as long as the string itself, so you haven't gained anything. So what you actually do is to compute

(s_0 * p^0 + s_1 * p^1 + ... + s_n * p^n) mod M

where M is reasonably small. Thus your hash value will always be smaller than M.

So what you do in practice is you choose M = 2^64 and make use of the fact that unsigned integer overflow is well-defined in most programming languages. In fact, multiplication and addition of 64-bit integers in Java, C++ and C# is equivalent to multiplication and addition modulo 2^64.

It's not necessarily a wise choice to use 2^64 as the modulus. In fact you can easily construct a string with lots of collisions, thus provoking the worst case behaviour of Rabin-Karp, which is Ω(n * m) matching instead of O(n + m).

It would be better to use a large prime as the modulus and get much better collision resistance. The reason why this is usually not done is performance: We would need to explicitely use modular reduction (add a % M) to every addition and multiplication. What's worse, we can't even use the builtin multiplication anymore, because it could overflow if M > 2^32. So we need a custom MultiplyMod function, which is bound to be a lot slower than machine-level multiplication.

Is this correct, or are there methods of implementation which will never need to overflow (I can imagine possibly some lazy evaluation which merely stores the ascii and unit place in the prime base)?

As I already mentioned, if you don't reduce using a modulus, your hash value will grow as large as the string itself, thus rendering it useless to use a hash function in the first place. So yes, using controlled overflow modulo 2^64 is correct and even necessary if we don't manually reduce.

OTHER TIPS

If your goal is a type of storage which contains only "small" number,
but where the sum can be compared:

You could view this simply as 101 - number system,
like 10=decimal, 16=hex. and so on.
Ie.

a) You have to store a set of { ascii value and it´s 101-power }
(without possibility for multiple entries with the same power).

b) When creating the data from a string,
values >101 have to be propagated (is this the right word?) to the next power.

Example 1:
"a" is 97*101^0
(trivial)

Example 2:
"g" is 1*101^1 + 2*101^0
because g is 103. 103>=101 ie. take only 103%101 for 101^0
(modulo, remainder of division)
and (int)(103/101) for the next power.

(if the ascii numers could be higher or the prime number is lower than 101
it could be possible that (int)(103/101) would exceed the prime numer too.
In this case, it would continue to prime^2 and so on, until the value is smaller
than the prime number)

Example 3:
"ag" is 98*101^1 + 2*101^0
Compared to above, 97*101^1 is added because of a. and so on...

To compare without calculating the full sum,
just compare the values of one power to each other, for each power.
Equal if all "power values" are the same.

Side note: Be aware that ^ is not exponentiation in languages like C# and Java.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top