Rabin Karp string matching algorithm

https://stackoverflow.com/questions/10338379

03-06-2021
|

Question

I've seen this Rabin Karp string matching algorithm in the forums on the website and I'm interested in trying to implement it but I was wondering If anyone could tell me why the variables ulong Q and ulong D are 100007 and 256 respectively :S? What significance do these values carry with them?

static void Main(string[] args)
{
    string A = "String that contains a pattern.";
    string B = "pattern";
    ulong siga = 0;
    ulong sigb = 0;
    ulong Q = 100007;
    ulong D = 256;
    for (int i = 0; i < B.Length; i++)
    {
        siga = (siga * D + (ulong)A[i]) % Q;
        sigb = (sigb * D + (ulong)B[i]) % Q;
    }
    if (siga == sigb)
    {
        Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
        return;
    }
    ulong pow = 1;
    for (int k = 1; k <= B.Length - 1; k++)
        pow = (pow * D) % Q;

    for (int j = 1; j <= A.Length - B.Length; j++)
    {
        siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
        siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
        if (siga == sigb)
        {
            if (A.Substring(j, B.Length) == B)
            {
                Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                    A.Substring(j, B.Length),
                                                                    A.Substring(j + B.Length)));
                return;
            }
        }
    }
    Console.WriteLine("Not copied!");
}

No correct solution

OTHER TIPS

About the magic numbers Paul's answer is pretty clear.

As far as the code is concerned, Rabin Karp's principal idea is to perform an hash comparison between a sliding portion of the string and the pattern.

The hash cannot be computed each time on the whole substrings, otherwise the computation complexity would be quadratic O(n^2) instead of linear O(n).

Therefore, a rolling hash function is applied, such as at each iteration only one character is needed to update the hash value of the substring.

So, let's comment your code:

for (int i = 0; i < B.Length; i++)
{
    siga = (siga * D + (ulong)A[i]) % Q;
    sigb = (sigb * D + (ulong)B[i]) % Q;
}
if (siga == sigb)
{
    Console.WriteLine(string.Format(">>{0}<<{1}", A.Substring(0, B.Length), A.Substring(B.Length)));
    return;
}

^ This piece computes the hash of pattern B (sigb), and the hashcode of the initial substring of A of the same length of B. Actually it's not completely correct because hash can collide¹ and so, it is necessary to modify the if statement : if (siga == sigb && A.Substring(0, B.Length) == B).

ulong pow = 1;
for (int k = 1; k <= B.Length - 1; k++)
    pow = (pow * D) % Q;

^ Here's computed pow that is necessary to perform the rolling hash.

for (int j = 1; j <= A.Length - B.Length; j++)
{
    siga = (siga + Q - pow * (ulong)A[j - 1] % Q) % Q;
    siga = (siga * D + (ulong)A[j + B.Length - 1]) % Q;
    if (siga == sigb)
    {
        if (A.Substring(j, B.Length) == B)
        {
            Console.WriteLine(string.Format("{0}>>{1}<<{2}", A.Substring(0, j),
                                                                A.Substring(j, B.Length),
                                                                A.Substring(j + B.Length)));
            return;
        }
    }
}

^ Finally, the remaining string (i.e. from the second character to end), is scanned updating the hash value of the A substring and compared with the hash of B (computed at the beginning).

If the two hashes are equal, the substring and the pattern are compared¹ and if they're actually equal a message is returned.

¹ Hash values can collide; hence, if two strings have different hash values they're definitely different, but if the two hashes are equal they can be equal or not.

The algorithm uses hashing for fast string comparison. Q and D are magic numbers that the coder probably arrived at with a little bit of trial and error and give a good distribution of hash values for this particular algorithm.

You can see these types of magic numbers used for hashing many places. The example below is the decompiled definition of the GetHashCode function of a .NET 2.0 string type:

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override unsafe int GetHashCode()
{
    char* chrPointer = null;
    int num1;
    int num2;
    fixed (string str = (string)this)
    {
        num1 = 352654597;
        num2 = num1;
        int* numPointer = chrPointer;
        for (int i = this.Length; i > 0; i = i - 4)
        {
            num1 = (num1 << 5) + num1 + (num1 >> 27) ^ numPointer;
            if (i <= 2)
            {
                break;
            }
            num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPointer + (void*)4;
            numPointer = numPointer + (void*)8;
        }
    }
    return num1 + num2 * 1566083941;
}

Here is another example from a R# generated GetHashcode override function for a sample type:

    public override int GetHashCode()
    {
        unchecked
        {
            int result = (SomeStrId != null ? SomeStrId.GetHashCode() : 0);
            result = (result*397) ^ (Desc != null ? Desc.GetHashCode() : 0);
            result = (result*397) ^ (AnotherId != null ? AnotherId.GetHashCode() : 0);
            return result;
        }
    }

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow