Indexing for similarity search

https://stackoverflow.com/questions/16844591

30-05-2022
|

Question

I have about 100M numeric vectors (Minhash fingerprints), each vector contains 100 integer numbers between 0 and 65536, and I'm trying to do a fast similarity search against this database of fingerprints using Jaccard similarity, i.e. given a query vector (e.g. [1,0,30, 9, 42, ...]) find the ratio of intersection/union of this query set against the database of 100M sets.

The requirement is to return k "nearest neighbors" of the query vector in <1 sec (not including indexing/File IO time) on a laptop. So obviously some kind of indexing is required, and the question is what would be the most efficient way to approach this.

notes: I thought of using SimHash but in this case actually need to know the size of intersection of the sets to identify containment rather than pure similarity/resemblance, but Simhash would lose that information.

I've tried using a simple locality sensitive hashing technique as described in ch3 of Jeffrey Ullman's book by dividing each vector into 20 "bands" or snippets of length 5, converting these snippets into strings (e.g. [1, 2, 45, 2, 3] - > "124523") and using these strings as keys in a hash table, where each key contains "candidate neighbors". But the problem is that it creates too many candidates for some of these snippets and changing number of bands doesn't help.

La solution 2

One way to go about this is the following:

(1) Arrange the vectors into a tree (a radix tree).

(2) Query the tree with a fuzzy criteria, in other words, a match is if the difference in values at each node of the tree is within a threshold

(3) From (2) generate a subtree that contains all the matching vectors

(4) Now, repeat process (2) on the sub tree with a smaller threshold

Continue until the subtree has K items. If K has too few items, then take the previous tree and calculate the Jacard distance on each member of the subtree and sort to eliminate the worst matches until you have only K items left.

Autres conseils

I might be a bit late, but I would suggest IVFADC indexing by Jegou et al.: Product Quantization for Nearest Neighbor Search

It works for L2 Distance/dot product similarity measures and is a bit complex, but it's particularly efficient in terms of both time and memory.

It is also implemented in the FAISS library for similarity search, so you could also take a look at that.

answering my own question after 6 years, there is a benchmark for approximate nearest neighbor search with many algorithms to solve this problem: https://github.com/erikbern/ann-benchmarks, the current winner is "Hierarchical Navigable Small World graphs": https://github.com/nmslib/hnswlib

You can use off-the-shelf similarity search services such as AWS-ES or Pinecone.io.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow