Question

I need to compare two groups of documents (e.g. one group might have 1000 documents) and determine which document of the second group is the most similar to the certain document in the first group. Thus far, I used TF/IDF and cosine similarity but I need something more faster and accurate like TF/IDF :) Can you suggest me some faster algorithm or improvement of TF/IDF time?

Was it helpful?

Solution

It depends on what type of differences you are trying to match. The fastest approach I know of is use shingle matching with minHash: http://www.stanford.edu/~ashishg/amdm/handouts/scribed-lec10.pdf http://en.wikipedia.org/wiki/MinHash

It is used to find near/exact duplicates, not partially similar documents.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top