Question

I am using the kmeans() function in R and I was curious what is the difference between the totss and tot.withinss attributes of the returned object. From the documentation they seem to be returning the same thing, but applied on my dataset the value of totss is 66213.63 and for tot.withinss is 6893.50. Please let me know if you are familiar with mroe details. Thank you!

Marius.

Was it helpful?

Solution

Given the between sum of squares betweenss and the vector of within sum of squares for each cluster withinss the formulas are these:

totss = tot.withinss + betweenss
tot.withinss = sum(withinss)

For example, if there were only one cluster then betweenss would be 0, there would be only one component in withinss and totss = tot.withinss = withinss.

For further clarification, we can compute these various quantities ourselves given the cluster assignments and that may help clarify their meanings. Consider the data x and the cluster assignments cl$cluster from the example in help(kmeans). Define the sum of squares function as follows -- this subtracts the mean of each column of x from that column and then sums of the squares of each element of the remaining matrix:

# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)

Then we have the following. Note that cl$centers[cl$cluster, ] are the fitted values, i.e. it iis a matrix with one row per point such that the ith row is the center of the cluster that the ith point belongs to.

example(kmeans) # create x and cl

betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))

withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or  resid <- x - fitted(cl); ss(resid)

totss <- ss(x) # or tot.withinss + betweenss

cat("totss:", totss, "tot.withinss:", tot.withinss, 
  "betweenss:", betweenss, "\n")

# compare above to:

str(cl)

EDIT:

Since this question was answered, R has added additional similar kmeans examples (example(kmeans)) and a new fitted.kmeans method and we now show how the fitted method fits into the above in the comments trailing the code lines.

OTHER TIPS

I think you have spotted an error in the documentation ... which says:

withinss     The within-cluster sum of squares for each cluster.
totss        The total within-cluster sum of squares.
tot.withinss     Total within-cluster sum of squares, i.e., sum(withinss).

If you use the sample dataset in the help page example:

> kmeans(x,2)$tot.withinss
[1] 15.49669
> kmeans(x,2)$totss
[1] 65.92628
> kmeans(x,2)$withinss
[1] 7.450607 8.046079

I think someone should write a request to the r-devel mailing list asking that the help page be revised. I'm willing to do so if you don't want to.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top