Question

I am performing k-means on a large dataset (636,688 rows x 7 columns) and have therefore turned to parallelization. I would like to iterate over number of centers and have included an example in which I attempt to iterate over 2 to 5 centers 2 times each.

# Iris k-means parallelization example
library(parallel)
data(iris)
iris.cluster <- iris[,-5]

cl <- makeCluster(detectCores())
worker <- function(data, nclus, nstarts){
  kmeans(x = data, centers = nclus, nstart = nstarts)
}
myiter <- 2
nstarts <- rep(25, myiter)
nclus <- 2:5
results <- clusterMap(cl, worker, data = iris.cluster, nclus = nclus, nstarts = nstarts)
stopCluster(cl)

The summary already tells me something is amiss:

> summary(results)
             Length Class  Mode
Sepal.Length 9      kmeans list
Sepal.Width  9      kmeans list
Petal.Length 9      kmeans list
Petal.Width  9      kmeans list

results should actually have 8 rows and no descriptions to the left of Length. It appears as if I am only using one variable per list entry. I am unfortunately not entirely clear on clusterMap and whether it is the right way to go in this case. I now know how to iterate over seed and nstart values (thank you Steve Weston) but need help in order to iterate over number of clusters, as described above.

Was it helpful?

Solution

You're having a problem passing the arguments to the worker function properly. I believe you need a nested loop over "centers" and "nstart", and you should also export "iris.cluster" to the cluster workers since you don't want to iterate over it. Perhaps this is closer to what you want to do:

library(parallel)
data(iris)
iris.cluster <- iris[,-5]

cl <- makeCluster(detectCores())
clusterExport(cl, 'iris.cluster')
worker <- function(centers, nstart) {
  kmeans(iris.cluster, centers=centers, nstart=nstart)
}
myiter <- 2
nstarts <- rep(25, myiter)
nclus <- 2:5
g <- expand.grid(nstarts=nstarts, nclus=nclus)
results <- clusterMap(cl, worker, centers=g$nclus, nstart=g$nstarts)
stopCluster(cl)

This uses the "expand.grid" function to generate the arguments for a total of length(nstarts) * length(nclus) tasks.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top