Question

So I'm thinking of using different distance metrics for k-means like Euclidean distance, Manhattan distance, cosine distance, Chebyshev distance among others. I just want to know the use cases of these distance metrics, associated with clustering.

Was it helpful?

Solution

The short answer is:

You should not.

K-means actually is not distance based.

It is based on variance minimization. And the variance is minimized by assigning each object to the one by closes squared euclidean distance (because squared euclidean essentially is the same as variance!). And since the sqrt function is monotone, you can also think of it as assinging by closest Euclidean distance.

Now if you plug in an arbitrary other distance function, it will no longer minimize variance, and k-means may stop converging.

Note that the other step in k-means is updating the means. Again to minimize variance, it is optimal to move the cluster center to the mean. If you plug in another distance function, this may no longer hold. Boom.

However, there are exception. Obviously, for some distance function the mean works well, too. So it will in fact also converge.

Furthermore, there exist variants such as K-medoids. This one actually is designed to minimize distances, and will work with arbitrary distances. It does not need the mean: instead it uses the most central object from your data set. This gives you back convergence for arbitrary distances!

Update: Here is an example how other distance measures may fail.

Assume we are using absolute Pearson Correlation for measuring similarity. The following two series are perfectly negative correlated, i.e. have distance 0 in absolute pearson:

+1 +2 +3 +4 +5
-1 -2 -3 -4 -5

If we compute the mean of these instances, the mean is 0 0 0 0 0, and Pearson similarity is A) no longer well defined, because the standard deviation has become 0; and even if we would fix this definition gap, the mean would be the most dissimilar vector possible with respect to this measure.

Therefore, only use k-means with other distances, when you can prove that the mean function minimizes your distances. As an exercise, you may want to prove that the mean does minimize squared Euclidean distances.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top