What to do with stale centroids in K-means

https://datascience.stackexchange.com/questions/9898

16-10-2019
|

Question

When I run Kmeans on my dataset, I notice that some centroids become stale in the they are no longer the closest centroid to any point after some iteration. Right now I am skipping these stale centroids in my next iteration because I think those centroids no longer represent any useful set of the data, however I wanted to know if there are other reasonable ways to deal with these centroids.

Solution

k-means finds only a local optima. Thus a wrong number of cluster or simply some random state of equilibrium in the attracting forces could lead to empty clusters. Technically k-means does not provide a procedure for that, but you can enrich the algorithm with no problem.

There are two approaches which I found that are useful:

remove the stale cluster, choose a random instance from your data set and create a new cluster with centroid equal with the chosen random point
remove the stale cluster, choose the farthest distant point from any other centroids, create a new cluster with centroid in that point

Both procedures can lead to indefinite running time, but if the number of this kind of adjustments is finite (and usually it is) that it will converge with no problem. To guard yourself from infinite running time you can set an upper bound for the number of adjustments.

The procedure itself is not practical if you have a huge data set a a large number of clusters. The running time can became prohibitive.

Another procedure to decrease the chances for that to happen is to use a better initialization procedure, like k-means++. In fact the second suggestion is an idea from k-means++. There are no guarantees, however.

Finally a note regarding implementation. If you can't change to code of the algorithm to make those improvements on the fly, your only option which comes to my mind is to start a new clustering procedure where you initialize the centroid positions for non-stale clusters, and follow procedures for stale clusters.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange