Question

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta. I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw. My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)

So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:

set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))  

vec    <- vector(mode = "integer", length = N )

for(i in 1:NDRAWS){
  # order probabilities after a single draw for every theta
  pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )

  # sum up winning positions for every theta
  vec[pos] <- vec[pos] + 1:N
}

# order thetas
ord <- order(-vec)

df[ord,]

This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???

# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()

Any help is appreciated!

Was it helpful?

Solution

This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.

Here's your complete example converted to use the foreach package with the doParallel backend:

set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
                 beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)

vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
               .combine='+') %dopar% {
  v <- integer(N)
  for(i in 1:ndraws) {
    pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
    v[pos] <- v[pos] + 1:N
  }
  v
}
ord <- order(-vec)
df[ord,]

Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.

OTHER TIPS

Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.

Perhaps try this?

myFunc <- function(xx, N) {
  pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
  vec[pos] + 1:N
}

Using doParallel will allow you to add results:

require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top