Question

I'm using the Statistics Toolbox function kmeans in MATLAB for the first time. I want to get the total euclidian distance to nearest centroid as an indicator of optimal k. Here is my code :

clear all

N=10;

opts=statset('MaxIter',1000);

X=dlmread(['data.txt']);

crit=zeros(1,N);
for j=1:N
    [a,b,c]=kmeans(X,j,'Start','cluster','EmptyAction','drop','Options',opts);
        clear a b
        crit(j)=sum(c);
end

save(['crit_',VF,'_',num2str(i),'_limswvl1.mat'],'crit')

Well everything should go well except that I get this error for j = 6 :
X must have more rows than the number of clusters.

I do not understand the problem since X has 54 rows, and no NaNs.
I tried using different EmptyAction options but it still won't work.

Any idea ? :)

Was it helpful?

Solution

The problem occurs since you use the cluster method to get initial centroids. From MATLAB documentation:

'cluster' - Perform preliminary clustering phase on random 10% subsample of X. This preliminary phase is itself initialized using 'sample'.

So when j=6, it tries to divide 10% of data into 6 clusters, i.e. 10% of 54 ~ 5. Therefore, you get the error X must have more rows than the number of clusters.

To get around this problem, either choose the points randomly (sample method) or choose points uniformly (uniform method).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top