Question

I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me) What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner. Thanks

Était-ce utile?

La solution

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top