How to create a more efficient simulation loop for Monte Carlo in R

https://stackoverflow.com/questions/9009143

14-11-2019
|

Question

The purpose of this exercise is to create a population distribution of nutrient intake values. There were repeated measures in the earlier data, these have been removed so each row is a unique person in the data frame.

I have this code, which works quite well when tested with a small number of my data frame rows. For all 7135 rows, it is very slow. I tried to time it, but I crashed it out when the elapsed running time on my machine was 15 hours. The system.time results were Timing stopped at: 55625.08 2985.39 58673.87.

I would appreciate any comments on speeding up the simulation:

Male.MC <-c()
for (j in 1:100)            {
for (i in 1:nrow(Male.Distrib))  {
    u2        <- Male.Distrib$stddev_u2[i] * rnorm(1, mean = 0, sd = 1)
    mc_bca    <- Male.Distrib$FixedEff[i] + u2
    temp      <- Lambda.Value*mc_bca+1
    ginv_a    <- temp^(1/Lambda.Value)
    d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
    mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
z <- data.frame(
     RespondentID = Male.Distrib$RespondentID[i], 
     Subgroup     = Male.Distrib$Subgroup[i], 
     mc_amount    = mc_amount,
     IndvWeight   = Male.Distrib$INDWTS[i]/100
     )

Male.MC <- as.data.frame(rbind(Male.MC,z))
    }
}

For each of the 7135 observations in my dataset, 100 simulated nutrient values are created, then back transformed to the original measurement level (the simulation is using the results from a nonlinear mixed effect model on BoxCox transformed nutrient values).

I would prefer not to use for loops, as I read that they are inefficient in R but I do not understand enough about options based on apply to use those as an alternative. R is being run on stand-alone machines, normally this would be a standard Dell-type desktop running a Windows 7 variant, if that influences the recommendations for how to change the code.

Update: To reproduce this for testing, Lambda.Value=0.4 and Male.Resid.Var=12.1029420429778 and Male.Distrib$stddev_u2 is a constant value over all observations.

str(Male.Distrib) is

'data.frame':   7135 obs. of  14 variables:
 $ RndmEff     : num  1.34 -5.86 -3.65 2.7 3.53 ...
 $ RespondentID: num  9966 9967 9970 9972 9974 ...
 $ Subgroup    : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
 $ RespondentID: int  9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
 $ Replicates  : num  41067 2322 17434 21723 375 ...
 $ IntakeAmt   : num  33.45 2.53 9.58 43.34 55.66 ...
 $ RACE        : int  2 3 2 2 3 2 2 2 2 1 ...
 $ INDWTS      : num  41067 2322 17434 21723 375 ...
 $ TOTWTS      : num  1.21e+08 1.21e+08 1.21e+08 1.21e+08 1.21e+08 ...
 $ GRPWTS      : num  41657878 22715139 10520535 41657878 10791729 ...
 $ NUMSUBJECTS : int  1466 1100 1424 1466 1061 1466 1424 1252 1061 1424 ...
 $ TOTSUBJECTS : int  7135 7135 7135 7135 7135 7135 7135 7135 7135 7135 ...
 $ FixedEff    : num  6.09 6.76 7.08 6.09 6.18 ...
 $ stddev_u2   : num  2.65 2.65 2.65 2.65 2.65 ...

head(Male.Distrib) is

    RndmEff RespondentID Subgroup RespondentID Replicates IntakeAmt RACE INDWTS    TOTWTS   GRPWTS NUMSUBJECTS TOTSUBJECTS  FixedEff stddev_u2
1  1.343753         9966        6         9966      41067 33.449808    2  41067 120622201 41657878        1466        7135  6.089918  2.645938
2 -5.856516         9967        5         9967       2322  2.533528    3   2322 120622201 22715139        1100        7135  6.755664  2.645938
3 -3.648339         9970        4         9970      17434  9.575439    2  17434 120622201 10520535        1424        7135  7.079757  2.645938
4  2.697533         9972        6         9972      21723 43.340180    2  21723 120622201 41657878        1466        7135  6.089918  2.645938
5  3.531878         9974        3         9974        375 55.660607    3    375 120622201 10791729        1061        7135  6.176319  2.645938
6  6.627767         9976        6         9976      48889 91.480049    2  48889 120622201 41657878        1466        7135  6.089918  2.645938

Update 2: the line of the function that is causing the NaN results is

d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))

Thanks to everyone for their assistance and comments, and also for the speed of responses.

Update: @Ben Bolker is correct that it is the negative temp values that are causing the NaN issue. I missed this with some testing (after commenting out the function so that only the temp values are returned, and calling my result data frame Test). This code reproduces the NaN issue:

> min(Test)
[1] -2.103819
> min(Test)^(1/Lambda.Value)
[1] NaN

But putting the value in as a value and then running the same(?) calculation gives me a result, so I missed this when doing manual calculations:

> -2.103819^(1/Lambda.Value) 
[1] -6.419792

I now have working code that (I think) uses vectorization, and it is blindingly fast. Just in case anyone else has this issue, I am posting the working code below. I've had to add a minimum to prevent the <0 issue with the calculation. Thank you to everyone who helped, and to coffee. I did try putting the rnorm results to a dataframe, and that really slowed things down, creating them this way and then using cbind is really quick. Male.Distrib is my full data frame of 7135 observations, but this code should work on the cutdown version I posted earlier (not tested).

Min_bca <- ((.5*min(Male.AddSugar$IntakeAmt))^Lambda.Value-1)/Lambda.Value
Test <- Male.Distrib[rep(seq.int(1,nrow(Male.Distrib)), 100), 1:ncol(Male.Distrib)]
RnormOutput <- rnorm(nrow(Test),0,1)
Male.Final <- cbind(Test,RnormOutput)
Male.Final$mc_bca    <- Male.Final$FixedEff + (Male.Final$stddev_u2 *     Male.Final$RnormOutput)
Male.Final$temp      <- ifelse(Lambda.Value*Male.Final$mc_bca+1 > Lambda.Value*Min_bca+1,
                           Lambda.Value*Male.Final$mc_bca+1, Lambda.Value*Min_bca+1)
Male.Final$ginv_a    <- Male.Final$temp^(1/Lambda.Value)
Male.Final$d2ginv_a  <- ifelse(0 > (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2),
                           0, (1-Lambda.Value)*Male.Final$temp^(1/Lambda.Value-2))
Male.Final$mc_amount <- Male.Final$ginv_a + Male.Final$d2ginv_a * Male.Resid.Var / 2

Lessons for the day:

a distribution function does not appear to be resampled in a loop if you try to do what I was trying earlier
you can't use max() the way I tried, as it returns the maximum value from the column, whereas I wanted the maximum from two values. The ifelse statement is the replacement one to do.

Solution

Here is an approach that addresses the 2 biggest speed issues:

Instead of looping over observations(i), we compute them all at once.
Instead of looping over MC replications (j), we use replicate, which is a simplified apply meant for this purpose.

First we load the dataset and define a function for what you were doing.

Male.Distrib = read.table('MaleDistrib.txt', check.names=F)

getMC <- function(df, Lambda.Value=0.4, Male.Resid.Var=12.1029420429778) {
  u2        <- df$stddev_u2 * rnorm(nrow(df), mean = 0, sd = 1)
  mc_bca    <- df$FixedEff + u2
  temp      <- Lambda.Value*mc_bca+1
  ginv_a    <- temp^(1/Lambda.Value)
  d2ginv_a  <- max(0,(1-Lambda.Value)*temp^(1/Lambda.Value-2))
  mc_amount <- ginv_a + d2ginv_a * Male.Resid.Var / 2
  mc_amount
}

Then we replicate it a bunch of times.

> replicate(10, getMC(Male.Distrib))
         [,1]      [,2]     [,3]     [,4]      [,5]     [,6]     [,7]     [,8]     [,9]    [,10]
[1,] 36.72374 44.491777 55.19637 23.53442 23.260609 49.56022 31.90657 25.26383 25.31197 20.58857
[2,] 29.56115 18.593496 57.84550 22.01581 22.906528 22.15470 29.38923 51.38825 13.45865 21.47531
[3,] 61.27075 10.140378 75.64172 28.10286  9.652907 49.25729 23.82104 31.77349 16.24840 78.02267
[4,] 49.42798 22.326136 33.87446 14.00084 25.107143 25.75241 30.20490 33.14770 62.86563 27.33652
[5,] 53.45546  9.673162 22.66676 38.76392 30.786100 23.42267 28.40211 35.95015 43.75506 58.83676
[6,] 34.72440 23.786004 63.57919  8.08238 12.636745 34.11844 14.88339 21.93766 44.53451 51.12331

Then you can reformat, add IDs, etc., but this is the idea for the main computational part. Good luck!

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow