Find plagiarism in bulk articles [closed]

https://stackoverflow.com/questions/20745094

article
plagiarism-detection
bulk
string-comparison

20-09-2022
|

Question

I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me) What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner. Thanks

La solution

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow