A fast and accurate method for comparing similarity between text documents [closed]

https://stackoverflow.com/questions/17549926

performance
algorithm
similarity
tf-idf

02-06-2022
|

Question

I need to compare two groups of documents (e.g. one group might have 1000 documents) and determine which document of the second group is the most similar to the certain document in the first group. Thus far, I used TF/IDF and cosine similarity but I need something more faster and accurate like TF/IDF :) Can you suggest me some faster algorithm or improvement of TF/IDF time?

Solution

It depends on what type of differences you are trying to match. The fastest approach I know of is use shingle matching with minHash: http://www.stanford.edu/~ashishg/amdm/handouts/scribed-lec10.pdf http://en.wikipedia.org/wiki/MinHash

It is used to find near/exact duplicates, not partially similar documents.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow