Assign starting configuration to k-means partitioning in r

https://stackoverflow.com/questions/22213482

r
k-means

10-06-2023
|

Question

Context: In Legendre & Legendre's numerical ecology textbook, they suggest assigning an initial starting configuration (or "seed groups") before doing a K-means partitioning because the algorithm is so sensitive to initial conditions. (where the initial starting configuration is determined through's Ward's clustering or ecological intuition)

Q: How do I give R an initial grouping in the K-means method? What specific Kmeans function can handle initial groupings?

Here is a snipet of my dataset where the column "seedgroup" defines the factors for initial groupings. I want to tell R to take Sites C and G as starting configuration for group 0, Sites A, D, F, H for group 1, and Sites B and E for group 2.

       seedgroup RhodDec VaccVit VaccOxy RubuCam ChamCal
SiteA         1    0.00    0.01    0.01    0.00    0.00
SiteB         2    0.00    0.01    0.00    0.00    0.00
SiteC         0    0.00    0.01    0.01    0.01    0.00
SiteD         1    0.00    0.01    0.00    0.00    0.00
SiteE         2    0.09    0.02    0.01    0.01    0.02
SiteF         1    0.00    0.00    0.01    0.03    0.02
SiteG         0    0.00    0.01    0.06    0.02    0.01
SiteH         1    0.00    0.01    0.00    0.00    0.00

Thanks!

Solution

Here's one way. The kmeans(...) function in base R has an option to specify initial cluster centers. So you could calculate centers based on the groupings implied in seedgroup. Calling your dataset df:

centers <- aggregate(df[,-1],by=list(df$seedgroup),mean)
km      <- kmeans(df[,2:6],centers=centers[,2:6])
df      <- data.frame(cluster=km$cluster-1,df)
df
#       cluster seedgroup RhodDec VaccVit VaccOxy RubuCam ChamCal
# SiteA       1         1    0.00    0.01    0.01    0.00    0.00
# SiteB       1         2    0.00    0.01    0.00    0.00    0.00
# SiteC       1         0    0.00    0.01    0.01    0.01    0.00
# SiteD       1         1    0.00    0.01    0.00    0.00    0.00
# SiteE       2         2    0.09    0.02    0.01    0.01    0.02
# SiteF       1         1    0.00    0.00    0.01    0.03    0.02
# SiteG       0         0    0.00    0.01    0.06    0.02    0.01
# SiteH       1         1    0.00    0.01    0.00    0.00    0.00

Note that kmeans(...) returns 1-based cluster numbers, whereas yours are 0-based. In this limited example, SiteB was moved from cluster 2 -> 1 and SiteC was moved from 0 -> 1, which looks reasonable based on the data.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow