Question

I found a rabin karp code from that site and changed to try. Changed code is below. You can see words and their hash values in hashtable.txt. for below example hashtable.txt seems right.

But when I changed M (block length) to 150 I am getting wrong results. For example in hashtable.txt first line and 6th line must be same but their hash values are different.

Or when I changed q (prime number) to 683303 it's getting wrong results too.

What's the relation between prime number and block length in rabin karp algorithm, and what's reason of wrong results?

#include<stdio.h>
#include<string.h>
#include <fstream>
#include <iostream>
// d is the number of characters in input alphabet
#define d 256 
int M = 80;
/*  
txt  -> text
q    -> A prime number
*/
using namespace std;

void writeTable(char *txt, int q)
{
ofstream myfile;
myfile.open ("hashtable.txt");
int N = strlen(txt);
int i, j;
int t = 0; // hash value for txt
int h = 1;

// The value of h would be "pow(d, M-1)%q"
for (i = 0; i < M-1; i++)
    h = (h*d)%q;

// Calculate the hash value of pattern and first window of text
for (i = 0; i < M; i++)
{
    t = (d*t + txt[i])%q;
}

// Slide the pattern over text one by one 
for (i = 0; i <= N - M; i++)
{
    myfile << t <<" ";
    for (long z = i; z < M+i; z++){myfile<<txt[z];}myfile<<"\n";

    // Calulate hash value for next window of text: Remove leading digit, 
    // add trailing digit           
    if ( i < N-M )
    {
        t = (d*(t - txt[i]*h) + txt[i+M])%q;

        // We might get negative value of t, converting it to positive
        if(t < 0) 
          t = (t + q); 
    }
}

myfile.close();
}

/* Driver program to test above function */
int main()
{
    char *txt ="abcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcdeabcde";

int q = 683303;  // A prime number

writeTable(txt, q);

printf("finish");
getchar();
return 0;
}
Was it helpful?

Solution

The computation

t = (d*(t - txt[i]*h) + txt[i+M])%q;

can overflow. The maximal value of txt[i] is d-1, and h can be as large as q-1. So if (q-1)*(d-1)*d > INT_MAX, there is the possibility of integer overflow. That limits the size of the prime that can be safely chosen to INT_MAX/(d*(d-1)) + 1.

If q is larger than that, that poses restrictions on the admissible values for M, namely M must be such that

h <= INT_MAX/(d*(d-1))

to safely prevent overflow.

With q = 683303 and M = 80, you get h = 182084, and

h*d*(d-1) = 182084 * 256 * 255 = 11886443520

is larger than INT_MAX if int is 32 bits wide as it usually is.

If your ints are 32 bits wide, you have overflow for the example from the beginning, because h*256*97 = 4521509888 > 2147483647.

OTHER TIPS

The "block length" is the length of the pattern. Since you don't have any pattern in your code, the number 150 is meaningless, unless that's the actual length of the pattern that you intend to use.

The values of hashes must depend on the data being hashed and on the amount of it. So, hashes of "abcde", "abcd", "abc" are likely to be all different.

In this algorithm you avoid unnecessary comparing of the pattern to a same-length portion of the text by first comparing the hashes of both.

If the hashes are different, you know the two sequences of characters are different and there's no match and so you can move on to the next position in the text and repeat the procedure.

If the hashes match, you have a potential match of the two character sequences and then you compare them to see if there's a real match.

This is the main idea of the algorithm and this is what makes it faster than naïve implementations of substring search.

The purpose of dividing by a prime number when calculating the hashes is to try to get a more uniform distribution of hash values. If you choose a very big prime number, it's not going to have much if any effect. If you choose a very small prime number, you reduce the total number of hash values and are increasing the odds of hash matches and therefore the odds of doing unnecessary substring comparison.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top