Question

Note: Lots of Possible duplicates, but nothing seems to be solving my problem.

I am working on a Plagiarism detection based on MOSS.

After successfully implementing a filter which strips out all the necessary details(comments,punctuations etc) I hash the content using a Rolling Hash Implementation(Rabin Karp)

However the Hashes which match in two text-files of source code, have very different underlying text(No plagiarism and yet same hashes)

The Algorithm I implemented(Ruby) --> (Partial Snippet)

 #Preprocessing from RobinKarp Algorithm
  for c in 0...k do
    text_hash=(radix*text_hash+text_to_process[c].ord)%q
  end

  #Main loop
  for c in 0...loop do   
        text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q    

Is there an issue with my implementation? Or the Parameters I specify can be at fault?

I take radix=34 ( I am not sure if it is the right value, I am assuming the stripped out text will only contain alphabets+some special charcters like '+','-','*','/' so a rough estimate of total 34 characters)

I am taking q(prime) to be 101

Is this a collision issue I am dealing with? Any pointers as to how to tackle the problem?

Was it helpful?

Solution

I note that with q = 101, there are only 101 possible hash values - 0, 1, 2...100. Have you tried increasing q? Another approach would be to look and see if the hash values look like they are randomly chosen values within the possible values of 0,1..q-1.

You should of course also test your program on cases where there are repeated strings for it to find - a failure there could be another symptom of any problem that is also causing collisions, and it would be easier to find and debug.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top