How to detect duplicate text with some fuzzyness
-
04-07-2019 - |
Question
Some thing ago, I write small script using Text::DeDupe to remove duplicates of blog posts before I have to lay my eyes on them.
After reading Syntactic Clustering of the Web paper on which implementation is based, I would love to have ability to find overlapping documents (e.g. snippets of blogs as opposed to full text, maybe also quotes).
Do you know of any other implementation in C, C++ or perl which I can try out before writing my own?
Solution
SpotSigs seems to fit my bill just right, here are some references:
- http://dbpubs.stanford.edu/pub/2008-10
- http://infoblog.stanford.edu/2008/08/spotsigs-are-stopwords-finally-good-for.html
- http://ilpubs.stanford.edu:8090/860/
The soruce code for this module is hosted on GitHub:
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow