Fast calculation of distances to each cluster center for an entire dataset

https://stackoverflow.com/questions/23370379

12-07-2023
|

Question

In a data clustering problem I have two numpy arrays, X and C, where X corresponds to observations and C corresponds to the centers of the clusters that can be formed with the data in X. Both of them have the same amount of columns (features) but C usually has way less rows than X. I'm trying to find a fast way of calculating the minimum squared distance between each observation in X and all the centers in C. In simple python this can be written as

D2 = np.array([min([np.inner(c-x,c-x) for c in C]) for x in X])

which is rather slow, so I mannaged to improve the speed by doin

D2 = np.array([min(np.sum((C-x)**2, axis=1)) for x in X])

instead, but I'm not yet satisfied with the execution time, and since a for loop still remains, I believe there is hope. Does anyone have an idea on how to further reduce the excecution time of this?

For the curious, I'm using this to generate seeds for K-Means through the K-Means++ algorithm.

Solution

The fastest you'll get with the numpy/scipy stack is a specialized function just for this purpose scipy.spatial.distance.cdist.

scipy.spatial.distance.cdist(XA, XB, metric='euclidean', p=2, ...)

Computes distance between each pair of the two collections of inputs.

It's also worth noting that scipy provides kmeans clustering as well.

scipy.cluster.vq.kmeans

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow