In most locality sensitive hashing implemensions of SimHash, why is the cosine distance used and not the euclidean distance?

cs.stackexchange https://cs.stackexchange.com/questions/98631

質問

In Chapter 3 of Mining of Massive Datasets, the basis of locality sensitive hashing is explained. They notably mention simhash for the cosine distance, where random hyperplanes are generated, and for each hyperplane, the projection of the vector to be hashed onto the hyperplane's normal is used for hashing the vector. They highlight that to instead measure euclidean distance, one can involve the use of a value $a$ as a segment length, used to split all hyperplane normals into some number of $a$-length segments. For each hyperplane, the segment into which the vector's projection falls into is used as its hash output. Hence the concatenation of this operation on each hyperplane generates a hash.

Yet a number of implementations, including what seems to be one of the most authoritative (falconn), do not use segments at all, and instead simply do a binary output depending on which side of the hyperplane the projection falls into. Why is this ? Why are segments not used ? What does the cosine distance have over the euclidean distance ?

正しい解決策はありません

ライセンス: CC-BY-SA帰属
所属していません cs.stackexchange
scroll top