Pairwise cdist in scipy instead of zip

https://stackoverflow.com/questions/22258923

11-06-2023
|

Question

I want to get the cdist between a list of a list of vectors and a list of centroids of each of those vectors.

In other words, I want to do the equivalent of [cdist(px, cent) ** 2 for px, cent in izip(pixelwise, centroids)].

So why not just do that? Because it's the slowest part of my program. I want to see if there's a way of doing it natively in numpy/scipy that's faster than the way I'm doing it with a list expression + zip in python.

Example code:

pixelwise = allframes.transpose((1, 0, 2))
centroids = pixelwise.mean((0,)).reshape((pixelwise.shape[0], 1, 3))
variances += weight * numpy.sum(
   [cdist(px, cent, 'euclidean') ** 2 for px, cent in izip(pixelwise, centroids)])

example values of pixelwise:

array([[[1, 1, 2],
        [2, 3, 4],
        [2, 2, 2]],

       [[1, 2, 3],
        [2, 3, 4],
        [2, 2, 2]],

       [[1, 2, 3],
        [2, 1, 1],
        [2, 2, 2]],

       [[4, 3, 2],
        [2, 3, 2],
        [2, 2, 2]]])

and what you get for centroids given that value of pixelwise:

array([[[ 1.75,  2.  ,  2.5 ]],

       [[ 2.  ,  2.5 ,  2.75]],

       [[ 2.  ,  2.  ,  2.  ]]])

Solution

If I understand your intent correctly, you are trying to estimate how far apart each "group" of vectors is from the centroids of the other groups. If that is the case, it looks like you are missing a normalization factor for the number of vectors in the group. Nevertheless, you can get a good estimate of this distance by simply considering

scipy.spatial.distance.pdist(centroids, 'euclidean')

i.e. the distance from the centroids to each other. This is a first-order approximation. If you use this data for an algorithm it may be good enough, in that it can find the sets of vectors that the most separated.

As the comments indicate the functionality that you were originally looking for is not built into scipy, you'll have to do each summation independently. However, the problem is embarrassingly parallel so it might help to use multiprocessing.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow