Question

Here is an example of how to use kmeans algorithm: http://mnemstudio.org/clustering-k-means-example-1.htm

In this example, the author used as initial centroids "the two individuals furthest apart (using the Euclidean distance measure)", as it was said.

What if I want not two clusters, but 10! How do I choose the first 10 centroids? Is there a way to choose the ten individuals furthest apart? Or should I use another way to choose them.

PS: I don't think using a randon choice will be good in my case. Also, I've been trying to use the first 10 individuals as centroids, but I am looking for a better way to choose them.

Was it helpful?

Solution

To simply choose the K most further apart entities as initial centroids is rather dangerous. Real-world data sets tend to have outliers, under your approach these would be chosen as initial centroids.

There are many initialization algorithms for K-Means, perhaps you would like to take a look at intelligent K-Means.

OTHER TIPS

The most common way to choose initial centroids is to use kmeans++ http://en.wikipedia.org/wiki/K-means%2B%2B . with theoretical performance guarantee.

http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

many python clustering package implements this initialization, such as mlpy, scipy KMeans. but I don't know about JAVA.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top