Term document matrix and cosine similarity in Python

https://stackoverflow.com/questions/18113379

23-06-2022
|

Pregunta

I have following situation that I want to address using Python (preferably using numpy and scipy):

Collection of documents that I want to convert to a sparse term document matrix.
Extract sparse vector representation of each document (i.e. a row in the matrix) and find out top 10 similary documents using cosine similarity within certain subset of documents (documents are labelled with categories and I want to find similar documents within the same category).

How do I achieve this in Python? I know I can use scipy.sparse.coo_matrix to represent documents as sparse vectors and take dot product to find cosine similarity, but how do I convert the entire corpus to a large but sparse term document matrix (so that I can also extract it's rows as scipy.sparse.coo_matrix row vectors)?

Thanks.

Solución

May I recommend you take a look at scikit-learn? This is a very well regarded library in the Python community with a very simple a consistent API. They have also implemented a cosine similarity metric. This is an example taken from here of how you could do it in 3 lines of code:

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vect = TfidfVectorizer(min_df=1)
>>> tfidf = vect.fit_transform(["I'd like an apple",
...                             "An apple a day keeps the doctor away",
...                             "Never compare an apple to an orange",
...                             "I prefer scikit-learn to Orange"])
>>> (tfidf * tfidf.T).A
array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],
       [ 0.25082859,  1.        ,  0.22057609,  0.        ],
       [ 0.39482963,  0.22057609,  1.        ,  0.26264139],
       [ 0.        ,  0.        ,  0.26264139,  1.        ]])

Otros consejos

you can refer to this question

Python: tf-idf-cosine: to find document similarity

I have answered the question in which you can find the cosine similarity with scikit package.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow