Prime Number and Block Length in Karp Rabin

Question 1

The computation

t = (d*(t - txt[i]*h) + txt[i+M])%q;

can overflow. The maximal value of txt[i] is d-1, and h can be as large as q-1. So if (q-1)*(d-1)*d > INT_MAX, there is the possibility of integer overflow. That limits the size of the prime that can be safely chosen to INT_MAX/(d*(d-1)) + 1.

If q is larger than that, that poses restrictions on the admissible values for M, namely M must be such that

h <= INT_MAX/(d*(d-1))

to safely prevent overflow.

With q = 683303 and M = 80, you get h = 182084, and

h*d*(d-1) = 182084 * 256 * 255 = 11886443520

is larger than INT_MAX if int is 32 bits wide as it usually is.

If your ints are 32 bits wide, you have overflow for the example from the beginning, because h*256*97 = 4521509888 > 2147483647.

Question 2

The "block length" is the length of the pattern. Since you don't have any pattern in your code, the number 150 is meaningless, unless that's the actual length of the pattern that you intend to use.

The values of hashes must depend on the data being hashed and on the amount of it. So, hashes of "abcde", "abcd", "abc" are likely to be all different.

In this algorithm you avoid unnecessary comparing of the pattern to a same-length portion of the text by first comparing the hashes of both.

If the hashes are different, you know the two sequences of characters are different and there's no match and so you can move on to the next position in the text and repeat the procedure.

If the hashes match, you have a potential match of the two character sequences and then you compare them to see if there's a real match.

This is the main idea of the algorithm and this is what makes it faster than naïve implementations of substring search.

The purpose of dividing by a prime number when calculating the hashes is to try to get a more uniform distribution of hash values. If you choose a very big prime number, it's not going to have much if any effect. If you choose a very small prime number, you reduce the total number of hash values and are increasing the odds of hash matches and therefore the odds of doing unnecessary substring comparison.