Question

I'm trying to use the snow package to score an elastic net model in R, but I can't figure out how to get the predict function to run across multiple nodes in the cluster. The code below contains both a timing benchmark and the actual code producing the error:

##############
#Snow example#
##############

library(snow)
library(glmnet)
library(mlbench)

data(BostonHousing)
BostonHousing$chas<-as.numeric(BostonHousing$chas)

ind<-as.matrix(BostonHousing[,1:13],col.names=TRUE)
dep<-as.matrix(BostonHousing[,14],col.names=TRUE)

fit_lambda<-cv.glmnet(ind,dep)

#fit elastic net
fit_en<<-glmnet(ind,dep,family="gaussian",alpha=0.5,lambda=fit_lambda$lambda.min)

ind_exp<-rbind(ind,ind)

#single thread baseline
i<-0
while(i < 2000){
    ind_exp<-rbind(ind_exp,ind)
    i = i+1
    }

system.time(st<-predict(fit_en,ind_exp))

#formula for parallel execution
pred_en<-function(x){
    x<-as.matrix(x)
    return(predict(fit_en,x))
    }

#make the cluster
cl<-makeSOCKcluster(4)
clusterExport(cl,"fit_en")
clusterExport(cl,"pred_en")

#parallel baseline
system.time(mt<-parRapply(cl,ind_exp,pred_en))

I have been able to parallelize via forking on a Linux box using multicore, but I ended up having to use a pretty poorly performing mclapply combined with unlist and was looking for a better way to do it with snow (that would incidentally work on both my dev windows PC and my prod Linux servers). Thanks SO.

Was it helpful?

Solution

I should start by saying that the predict.glmnet function doesn't seem to be compute intensive enough to be worth parallelizing. But this is an interesting example, and my answer may be helpful to you, even if this particular case isn't worth parallelizing.

The main problem is that the parRapply function is a parallel wrapper around apply, which in turn calls your function on the rows of the submatrices, which isn't what you want. You want your function to be called directly on the submatrices. Snow doesn't contain a convenience function that does that, but it's easy to write one:

rowchunkapply <- function(cl, x, fun, ...) {
    do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}

Another problem in your example is that you need to load glmnet on the workers so that the correct predict function is called. You also don't need to explicitly export the pred_en function, since that is handled for you.

Here's my version of your example:

library(snow)
library(glmnet)
library(mlbench)

data(BostonHousing)
BostonHousing$chas <- as.numeric(BostonHousing$chas)
ind <- as.matrix(BostonHousing[,1:13], col.names=TRUE)
dep <- as.matrix(BostonHousing[,14], col.names=TRUE)
fit_lambda <- cv.glmnet(ind, dep)
fit_en <- glmnet(ind, dep, family="gaussian", alpha=0.5,
                 lambda=fit_lambda$lambda.min)
ind_exp <- do.call("rbind", rep(list(ind), 2002))

# make and initialize the cluster
cl <- makeSOCKcluster(4)
clusterEvalQ(cl, library(glmnet))
clusterExport(cl, "fit_en")

# execute a function on row chunks of x and rbind the results
rowchunkapply <- function(cl, x, fun, ...) {
    do.call('rbind', clusterApply(cl, splitRows(x, length(cl)), fun, ...))
}

# worker function
pred_en <- function(x) {
    predict(fit_en, x)
}
mt <- rowchunkapply(cl, ind_exp, pred_en)

You may also be interested in using the cv.glmnet parallel option, which uses the foreach package.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top