How to verify that the source code is copied from web

https://stackoverflow.com/questions/12032821

27-06-2021
|

Question

I am building a web tool to check whether the submitted content is taken from web or is it submitter own work. A plagiarism detector.

I have some idea that I can generated check sum and use that as a key to compare with other entries. However, if someone has made some small changes like including/removing comments, changing variables/function name and so on then the checksum will be different, so this approach won't work.

Any suggestions for a better way?

Solution

Plagiarism detection is a special case of similarity detection. This is a big field of study that's almost as old as computer science its self. There is a lot of published research, and there just isn't a single simple answer.

See, eg, a Google Scholar search for "code similarity plagiarism" or "plagiarism detection". Regular Google searches for things like "source code similarity detection algorithm" can also be useful.

There are plenty of existing tools in the space, too, so I'm surprised you're trying to write your own.

As you've noted, a check-sum won't do the job unless the code is perfectly identical. Techniques that can help include:

Building word-frequency histograms and comparing them
Extracting comment text and looking for copied comments using text-substring matching
Extracting variable, class and method names and looking for other code that uses the same names. You have to do a lot of correction for "obvious" names that everyone will choose, and for names that're dictated by the problem, like implementing a particular interface or API. Private class member variables and the local variables inside a function or method are the most useful to compare. You will need the help of a compiler or at least syntax parser for the language to extract these.
Looking for differences in indenting style. Did the user use all-spaces indenting, except for this one function that's indented with tabs?
Comparing parse trees or token streams to strip out the effects of formatting. You'd usually have to compare individual functions, etc, not just the code as a whole.
... and lots more

What you'll have to do is produce a report that weighs all these factors and others and presents them to a human so the human can make a decision. Your tool should explain why it thinks two results are similar, not just that they are similar.

OTHER TIPS

How i would aproach this, and custom enhancements can be added lately:

Remove everything that is not a letter or number;

Use explode() with empty space character as delimiter and find all the words; now you know how many words you have in that article;

Now, you must find out how many times a word apears in that article, and increase the word indicator each time that word is found in the text;

Store this into an array, like:

$words['wordX']++;

Do this also with the seccond article that you want to check with;

Now, compare them; You know the original data; some conclusions ca be made at this step;

Using the big characters, like J from John, F from Feudalism, you can also make some conclusions;

From here you may know if the article is about the same thing, and this could be the real step #1

Now, somehow you have to parse both articles, word by word, in the same time, and see the differecnce beetween them.

A student can add a own "original" sentence after each sentence/paragraph found in the original article.

Make sure that if you advance to much in the parsing process on one of the articles, you somehow keep a balanced parsing process and try to parse the seccond article until you reach that balance.

i see 2 for instructions, maybe 3, or instead of 3, a function that tryes to keep the balance in the parsing process.

Also, you have to use explode() and check sentence by sentence, and word by word from each sentence and find the similarity.

I am sure that you get the idea, but i say again, you cant parse the entire WWW.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow