Question

I'm trying to use clusplot to visualize kmeans clustering. Reinventthewheel.csv is a symmetric similarity matrix (1087 rows) with values in [0,1]. For some reason, clusplot will only generate a plot of n clusters for certain values of n. For other values of n, it returns the error below:

library(cluster)
simmy = read.csv("reinventthewheel.csv", header=TRUE, row.names=1)
disty = dist(1-simmy)
kay19 <- kmeans(disty,19)$cluster
par(mfrow=c(3,2))
clusplot(disty, diss=TRUE, kay19, color=FALSE, shade=FALSE, lines=0, col.p=kay19, main="KMEANS", sub="19 assortments")
    #(successfully plotted for n=19)

kay20 <- kmeans(disty,20)$cluster
clusplot(disty, diss=TRUE, kay20, color=FALSE, shade=FALSE, lines=0, col.p=kay20, main="KMEANS", sub="20 assortments")
Error in seq.default(-sqrt(yl2), sqrt(yl2), length = n.half) : 
    'from' cannot be NA, NaN or infinite
    #(failed to plot for n=20)

kay21<-kmeans(disty,21)$cluster
clusplot(disty, diss=TRUE, kay21, color=FALSE, shade=FALSE, lines=0, col.p=kay21, main="KMEANS", sub="21 assortments")
Error in seq.default(-sqrt(yl2), sqrt(yl2), length = n.half) : 
    'from' cannot be NA, NaN or infinite
    #(failed to plot for n=21)

kay22<-kmeans(disty,22)$cluster
clusplot(disty, diss=TRUE, kay22, color=FALSE, shade=FALSE, lines=0, col.p=kay22, main="KMEANS", sub="22 assortments")
    #(successfully plotted for n=22)

I thought perhaps n=20 and n=21 were generating empty clusters, but that is not the case. Each cluster has at least one point.

I don't get these errors when plotting hierarchical clusters (using cutree(hclust)) from the same matrix for any n. Any ideas about what could be causing this error? Thanks in advance.

Was it helpful?

Solution

There are several things going on here.

First, you should be aware that kmeans(dist,n) uses an algorithm that defines n cluster centroids at random and then moves them around until it's minimization criteria are met. This often leads to a local minimum, which in turn means that if you run kmeans(dist,n) repeatedly, with the same dist and n, you are likely to get different clusters each time. This is especially problematic with a large number of clusters, or clusters which are poorly differentiated, both of which conditions apply in your case.

You can make the process repeatable using set.seed(x) before running kmeans(...), but this still does not guarantee the "best" arrangement of n clusters. So when I run your code with your data clusplot(...) works fine for 19, 20, and 22 clusters, and fails for 21 clusters. This is because I'm getting different clusters than you were.

Second, the error is apparently due to failure of the algorithm that calculates the ellipses for each cluster, in some cases. The default, clusplot(...,span=T), uses a minimum volume ellipsoid approach, which is supposed to enclose each cluster in the smallest ellipse that contains all the points in the cluster. Evidently, for some arrangement of points, this algorithm fails. span=F generates ellipses based on the assumption that the points in a cluster follow a bivariate normal distribution and bases the ellipse on the covariance matrix of the points in each cluster. When I run your code with span=F, I get no errors.

The latter approach essentially draws confidence bands around the centroid of each cluster (I believe these are 95% confidence bands, but I'm not sure). While this leads to much larger ellipses, and a plot that is not as pretty as the minimum volume approach, IMO this is a much better way to represent the data, because it accurately depicts the fact that there is a lot of overlap in your clusters: many of the points could just as easily belong in multiple clusters. When I use the confidence band approach, I get the plots below. The code at the end is almost identical to yours, but I include it to show that if you run that code you will get the same result.

library(cluster)
simmy = read.csv("reinventthewheel.csv", header=TRUE, row.names=1)
disty = dist(1-simmy)
set.seed(1)
kay19 <- kmeans(disty,19)$cluster
kay20 <- kmeans(disty,20)$cluster
kay21<-kmeans(disty,21)$cluster
kay22<-kmeans(disty,22)$cluster

par(mfrow=c(2,2))
s=FALSE
clusplot(disty, diss=TRUE, kay19, color=FALSE, shade=FALSE, lines=0, col.p=kay19, main="KMEANS", sub="19 assortments",span=s)
clusplot(disty, diss=TRUE, kay20, color=FALSE, shade=FALSE, lines=0, col.p=kay20, main="KMEANS", sub="20 assortments",span=s)
clusplot(disty, diss=TRUE, kay21, color=FALSE, shade=FALSE, lines=0, col.p=kay21, main="KMEANS", sub="21 assortments",span=s)
clusplot(disty, diss=TRUE, kay22, color=FALSE, shade=FALSE, lines=0, col.p=kay22, main="KMEANS", sub="22 assortments",span=s)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top