문제

I need to compare two groups of documents (e.g. one group might have 1000 documents) and determine which document of the second group is the most similar to the certain document in the first group. Thus far, I used TF/IDF and cosine similarity but I need something more faster and accurate like TF/IDF :) Can you suggest me some faster algorithm or improvement of TF/IDF time?

도움이 되었습니까?

해결책

It depends on what type of differences you are trying to match. The fastest approach I know of is use shingle matching with minHash: http://www.stanford.edu/~ashishg/amdm/handouts/scribed-lec10.pdf http://en.wikipedia.org/wiki/MinHash

It is used to find near/exact duplicates, not partially similar documents.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top