Question

This question is related to this one, where I was asking how to replicate a user-defined function. Now I would like to parallelize the operations in order to save time. What I have preliminarly done is:

  1. I have defined a custom function my.fun(), which returns output, a matrix with 1000 rows and 20 columns.

  2. I replicate say 5 times output, and store the results in a single matrix called final through: final <- do.call(rbind, replicate(5, my.fun(), simplify=FALSE)). Hence, in this example final is a 5000-rows matrix.

What I would like to do now is to parallelize the 5 (or even more..) output replications before binding the results in the final matrix.

How would you do that? What I have (wrongly) done so far is:

    library(snowfall)

    sfInit(parallel = TRUE, cpus = 4, type = "SOCK")

    # previously defined objects manipulated within my.fun
    sfExport(...)

    my.fun = function() {
       ...
       return(output)
    }

    final <- do.call(rbind, sfSapply(1:5, fun=my.fun(), simplify=FALSE))

    sfStop()

but it returns:

Error in get(as.character(FUN), mode = "function", envir = envir) : 
  object 'fun' of mode 'function' was not found

Any help would be greatly appreciated! Please, consider that I do not necessairly want to use -snowfall-: the final goal is to parallelize the computation of final in an efficient way (in reality I have to make a lot of replications..).

Was it helpful?

Solution

sfSapply expects fun to be a function, but you hand over the result of one call to my.fun. That is, you want to hand over my.fun, not my.fun ().

OTHER TIPS

I don't have any experience with parallel computing in R. I had to add a dummy argument to the function my.func, otherwise sfSapply complains with this error

 first error: unused argument(s) (X[[1]])

So I add x as argument

  my.fun <- function(x) matrix(1:4, 2,2)

Now I tried to benchmark the parallel and the sapply solution

  sfInit(parallel = TRUE, cpus = 4)
  library(rbenchmark)
  benchmark(
  pp = sfSapply(1:20000, fun=my.fun, simplify=FALSE),
  nopp = sapply(1:20000, FUN=my.fun, simplify=FALSE))

The parallel solution is slower than the classic one!! I am really confusing. maybe others more experienced with R paraelle computing can give us a logic explanation..

 test replications elapsed relative user.self sys.self user.child sys.child
2 nopp          100   15.22    1.000     13.90     0.02         NA        NA
1   pp          100   27.28    1.792     11.95     2.04         NA        NA
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top