質問

I have a 20,000 collection of master articles and I will get about 400,000 articles of one or two pages everyday. Now, I am trying to see if each one of this 400k articles are a copy or modified version of my collection of master articles (a threshold of above 60% plagiarism is fine with me) What are the algorithms and technologies I should use to tackle the problem in a very efficient and timely manner. Thanks

役に立ちましたか?

解決

Fingerprint the articles (i.e. intelligently hash them based on the word frequency) and then look for statistical connection between the fingerprints. Then if there is a hunch on some of the data set, do a brute force search for matching strings on those.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top