K-means: What are some good ways to choose an efficient set of initial centroids?

https://datascience.stackexchange.com/questions/5656

16-10-2019
|

Question

When a random initialization of centroids is used, different runs of K-means produce different total SSEs. And it is crucial in the performance of the algorithm. What are some effective approaches toward solving this problem? Recent approaches are appreciated.

Solution

An approach that yields more consistent results is K-means++. This approach acknowledges that there is probably a better choice of initial centroid locations than simple random assignment. Specifically, K-means tends to perform better when centroids are seeded in such a way that doesn't clump them together in space.

In short, the method is as follows:

Choose one of your data points at random as an initial centroid.
Calculate $D(x)$, the distance between your initial centroid and all other data points, $x$.
Choose your next centroid from the remaining datapoints with probability proportional to $D(x)^2$
Repeat until all centroids have been assigned.

Note: $D(x)$ should be updated as more centroids are added. It should be set to be the distance between a data point and the nearest centroid.

You may also be interested to read this paper that proposes the method and describes its overall expected performance.

OTHER TIPS

I may be misunderstanding your question, but usually k-means chooses your centroids randomly for you depending on the number of clusters you set (i.e. k). Choosing the number for k tends to be a subjective exercise. A good place to start is an Elbow/Scree plot which can be found here:

http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set#The_Elbow_Method

The usual approach to this problem is to re-run your K-means algorithm several times, with different random initializations of the centroids, and to keep the best solution. You can do that by evaluating the results on your training data or by means of cross validation.

There are many other ways to initialize the centroids, but none of them is going to perform the best for every single problem. You could evaluate these approaches together with random initialization for your particular problem.

I agree with the Elbow/Scree plot. I found it more intuitively sensible than a random seed. Here's an example code to try it.

Ks=30
mean_acc=np.zeros((Ks-1))
std_acc=np.zeros((Ks-1))
ConfustionMx=[];
for n in range(1,Ks):    
    #Train Model and Predict  
    kNN_model = KNeighborsClassifier(n_neighbors=n).fit(X_train,y_train)
    yhat = kNN_model.predict(X_test)
    mean_acc[n-1]=np.mean(yhat==y_test);
    std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

plt.plot(range(1,Ks),mean_acc,'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange