Pergunta

I have a dataset with about 1 000 000 texts where I have computed their sentence embeddings with a language model and stored them in a numpy array.

I wish to compare a new unseen text to all the 1 000 000 pre-computed embeddings and perform cosine similarity to retrieve the most semantic similar document in the corpus.

What is the most efficient to perform this 1-vs-all comparison?

I would thankful for any pointers and feedback!

Foi útil?

Solução

There are libraries that are specialized in exactly that task, for instance FAISS by Facebook AI Research:

Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Faiss is written in C++ with complete wrappers for Python/numpy. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research.

For cosine similarity, you can use FAISS class IndexFlatIP having normalized the vectors first, as specified in FAISS documentation.

If you want to know more about the techniques behind FAISS, you can have a look at their research paper, and if you want to know more about the fundamentals of similarity search in general, you can have a look at this blog post by Flickr .

Some alternatives to FAISS are Annoy, NMSLib and Yahoo's NGT. You can find a couple of comparisons of these and other libraries here and here.

Outras dicas

@lsbister,

You could create a pandas dataframe and use a dask function/lambda function to parellize the computation of one vs all at the same time.

If you use dask, you can create partitions and map the response back. In case you use pandas, you can use the apply function and parallelize the computations to a certain extent.

I would suggest also to give a shot to gensim. It's pretty quick compare to other self-written 'top_n retrieval by cosine-similarity' functions.
Save your embeddings in a .txt or .csv file and then load it using the command 'load_word2vecformat'. Once you load the model you can use the function 'similar_by_word' or 'similar_by_vector' to retrive the n closest vectors.

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format(datapath('model_file'), binary=False)   
top_10 = model.similar_by_word('cat')

Just pay attention when you save the file to include as first line the number of embeddings, 1M in your case, and the vectors dimension. It's just the gensim standard format, the file should look like this:

1000000, 300
'.', -0.0001, -0.0001, ... 
',', -0.0001, -0.0001, ...
'a', -0.0001, -0.0001, ...
Licenciado em: CC-BY-SA com atribuição
scroll top