Rabin–Karp algorithm for plagiarism implementation by using rolling hash

https://stackoverflow.com/questions/8437904

11-03-2021
|

Question

i am using Rabin–Karp algorithm to check plagiarism for any two source code files so firstly i simply implement its algorithm in c # here its code but its average and best case running time is O(n+m) in space O(p), but its worst-case time is O(nm).

 public void plagiarism(string [] file1, string [] file2)
    {
        int percent = 0;

        for (int i = 0; i <(file1.Length - file2.Length +1); i++)
        {

            for (int j = 0; j < file1.Length; j++)
            {
                if (file1[i + j - 1] != file2[j])
                {


                }

                    percent++;
                Console.WriteLine(percent);
            }


            Console.WriteLine("not copied");
        }

    }

so how would make it more efficient by using rolling hash function because that is better than this..

Solution

The Wikipedia article has a reasonably good discussion of the algorithm, and even mentions how you can implement the rolling hash function (see "Use of hashing for shifting substring search"). It also addresses how to improve runtime speed using a hash table or Bloom filter.

You also have to understand that the worst case is a fairly contrived example. The example given in the Wikipedia article is 'searching for a string of 10,000 "a"s followed by a "b" in a string of 10 million "a"s.'

You should be able to implement the rolling hash using the techniques described in that Wikipedia entry. If you're having trouble implementing that, leave a more specific question about how it's done, showing what you've tried.

It's unlikely that you'll encounter anything approaching the worst case in real-world documents. Even if you were to encounter the worst case, the rolling hash will not reduce the complexity. Implementing the rolling hash gives a linear improvement in runtime, which will be swamped by the n*m complexity. If you find that the worst case happens often, then you probably need a different algorithm.

The other thing to note is that, whereas O(m*n) can be a problem, you have to look at the scale. How large are the documents you're examining? You say you're working with source code files. If you're looking at typical class projects, then you're probably talking maybe 2,000 lines of code. Those documents aren't going to exhibit the worst case. Even if they did, n*m isn't going to be a very large number.

However, if you have 100 documents and you want to know if any one is a substantial duplicate of the other, your larger problem is O(n^2) because you have to check every document against all the others. The number of document comparisons is equal to (n*(n-1))/2. If you're looking to optimize your process, you need a different algorithm. Ideally, something that will give you a "fingerprint" of a document. That way, you can compute the fingerprint for each document one time, and then compare the fingerprints for similarity.

Document fingerprinting is a well known problem. However, constructing a fingerprint that's useful for comparison purposes is a bit less straightforward. You'd want to look into a technique called shingling. I also saw some research about using a small Bloom filter (256 bytes or so) to represent a document, and the ability to do fast comparisons using that.

All that said, I suspect that if you're talking a hundred or two source code files that are each maybe 1,000 or 2,000 lines long, the naive O(n^2) comparison technique using a good Rabin-Carp implementation will do what you want. It will take some time (you're going to do 5,000 separate document comparisons), but I don't think the speed of the R-K implementation will be your limiting factor.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow