Python: Cosine Similarity m * n matrices

https://stackoverflow.com/questions/11405673

19-06-2021
|

Question

I have two M X N matrices which I construct after extracting data from images. Both the vectors have lengthy first row and after the 3rd row they all become only first column. for example raw vector looks like this

1,23,2,5,6,2,2,6,2,
12,4,5,5,
1,2,4,
1,
2,
2
:

Both vectors have a similar pattern where first three rows have lengthy row and then thin out as it progress. Do do cosine similarity I was thinking to use a padding technique to add zeros and make these two vectors N X N. I looked at Python options of cosine similarity but some examples were using a package call numpy. I couldn't figure out how exactly numpy can do this type of padding and carry out a cosine similarity. Any guidance would be greatly appreciated.

Solution

If both arrays have the same dimension, I would flatten them using NumPy. NumPy (and SciPy) is a powerful scientific computational tool that makes matrix manipulations way easier.

Here an example of how I would do it with NumPy and SciPy:

import numpy as np
from scipy.spatial import distance

A = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )
B = np.array([[1,23,2,5,6,2,2,6,2],[12,4,5,5],[1,2,4],[1],[2],[2]], dtype=object )

Aflat = np.hstack(A)
Bflat = np.hstack(B)

dist = distance.cosine(Aflat, Bflat)

The result here is dist = 1.10e-16 (i.e., 0).

Note that I've used here the dtype=object because that's the only way I know to be able to store different shapes into an array in NumPy. That's why later I used hstack() in order to flatten the array (instead of using the more common flatten() function).

OTHER TIPS

I would make them into a scipy sparse matrix (http://docs.scipy.org/doc/scipy/reference/sparse.html) and then run cosine similarity from the scikit learn module.

from scipy import sparse
sparse_matrix= scipy.sparse.csr_matrix(your_np_array)

from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

distance_matrix= pairwise_distances(sparse_matrix, metric="cosine")

Why cant you just run a nested loop over both jagged lists (presumably), summating each row using Euclidian/vector dot product and using the result as a similarity measure. This assumes that the jagged dimensions are identical.

Although I'm not quite sure how you are getting a jagged array from a bitmap image (I would of assumed it would be a proper dense matrix of MxN form) or how the jagged array of arrays above is meant to represent an MxN matrix/image data, and therefore, how padding the data with zeros would make sense? If this was a sparse matrix representation, one would expect row/col information annotated with the values.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow