Detecting plagiarism – what algorithm?

https://softwareengineering.stackexchange.com/questions/344223

08-01-2021
|

Question

I'm currently writing a program to read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.

The two strings being compared are the original paper and the plaintext of the webpage (as returned by Floki.text/2 run on the <body> of the page). In both cases, all punctuation and formatting has been stripped out and replaced by spaces.

I'm not sure what sort of edit distance algorithm to use for this. I've looked through all the ones listed on Wikipedia, and...

Levenshtein distance (and Damerau-Levenshtein) seems like it (they) would have trouble detecting e.g. a few stolen sentences in the middle of an otherwise-distinct paper.
Longest Common Subsequence can (might?) be foiled by very slight rephrasings of low-meaning words ("a thing" vs. "the thing").
Hamming distance is completely incompatible since the two texts probably won't be exactly the same length unless someone copied the entire thing.
Jaro and Jaro-Winkler are for short strings – the way it only looks within a certain proximity of position just doesn't work when you might have a sentence from one paper cut out and inserted at the beginning of the other.

Solution

This is somewhat of an XY answer but given you started with

read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.

It seems text search itself is a good, practical answer to your problem. The basic way of detecting plagiarisms would be the following:

Start with a corpus of documents that the target document could have been plagiarized.
Create, e.g., a Lucene based inverted index over those documents (through say Solr or Elasticsearch).
Split your target document into a set of phrases (e.g. by breaking off each sentence / sub-sentence / every n words).
Search your corpus for each phrase. You will return a (possibly empty) set of documents that that phrase could have been plagiarized from (and the location(s) in each document it was possibly taken from).
Collect all of these potential instances of plagiarism. If this exceeds more than a small threshold of phrases, alarm the target as probably being plagiarized.

This approach has several advantages over trying to diff strings:

It allows you to pinpoint exactly what in the target document might have been plagiarized and where it could have come from. This will allow humans reviewing the output to have visibility and make intelligent decisions on the output.
A good indexing solution will buy you the ability to work around misspellings and different stop words / tiny differences in phrasing.
A good indexing solution will scale very well.
Having a self-managed corpus will behave much better than searching the internet. The internet is such a wild and unruly place that you are likely to get spurious matches and miss out on important matches. That is, Google may catch students copying from Wikipedia, but it is also liable to falsely accuse people of copying from random blogs if you are not very, very careful. It is also liable to miss things like ArXiv papers in the field, essays students can buy from shady websites, past essays written from other students, that are very realistic sources of plagiarism.

If you think about Turn-it-in, their approach must be similar to this as they

Tell you where the essay could have been plagiarized
Can include past-papers / non-wiki & co. sourcing.

The value that Turn-it-in and similar can add over just setting up a system like this yourself (which honestly would not be too hard) is

Size and quality of their reference corpus
Development time of their UI
Tuning of their indexing and searching
Sophistication in how they determine phrases and their thresholds for likely plagiarism.

OTHER TIPS

Your intend is to compare text body with search engine results to detect plagiarism.

Unfortunately the algorithms that you consider work at character level. They are time consuming with longer texts, and are not well suited to detect blocks of texts or paragraphs that are inverted.

Why not opt for a word approach: you make a sorted list of unique words in your text and in the search results, and you look for similarity (proportion of common words). This can be very efficient. If the similarity exceeds a certain quota, you could then go for a more time consuming comparison, whether you use algorithms at character level, or if you use similar algorithms using string of words instead of individual chars

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange