Question

I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me) What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner. Thanks

Was it helpful?

Solution

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top