Can k-means clustering get shells as clusters?

https://datascience.stackexchange.com/questions/9172

16-10-2019
|

Question

Imagine you have $k$ classes. Every class $i$ has points which are follow a probability distribution, such that their distance to 0 is $i$ in mean, but this distance follows a normal distribution. The direction is uniformly distributed. So all classes are in shells around the origin 0.

Can $k$-means get these shells when you choose the "right" distance metric? (Obviously it can't find it if you take the euclidean metric, but I wonder if there is any metric at all or if this problem is inherently unsolvable by $k$-means, even if you know the number of clusters $k$)

Solution

You cannot just use arbitrary distance functions with k-means.

because the algorithm is not based on metric properties but on variance.

https://stats.stackexchange.com/q/81481/7828

Fact is that k-means minimizes the sum of squares. This does not even give you the "smallest distances" but only the smallest squared distances. This is not the same (see: difference between median and mean) - if you want to minimize Euclidean distances, use k-medians or if you want other distances PAM (k-medoids).

You can generalize k-means to use a few more distances known as "Bergman divergences" and you can do some variant of the kernel trick. But that is not very powerful, because you don't have labels for optimizing the kernel parameters! Still, that could be what this exercise question is up to... If your "shells" are indeed centered at 0, then you can transform your data (read: kernel trick done wrong) to angle+distance from origin, and k-means may be able to cluster the projected data (dependent on the not well defined scaling of the axes). Or the textbook did not realize that a kernel k-means has been proposed long ago. Then the argument is probably this: the mean of each shell is 0, and thus the shells cannot be distinguished. This clearly holds for unmodified k-means.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange