Question

I'm trying to generate a data frame of simulated values based on existing distribution parameters. My main data frame contains the mean and standard deviation for each observation, like so:

example.data <- data.frame(country=c("a", "b", "c"), 
                           score_mean=c(0.5, 0.4, 0.6), 
                           score_sd=c(0.1, 0.1, 0.2))

#   country score_mean score_sd
# 1       a        0.5      0.1
# 2       b        0.4      0.1
# 3       c        0.6      0.2

I can use sapply() and a custom function to use the score_mean and score_sd parameters to randomly draw from a normal distribution:

score.simulate <- function(score.mean, score.sd) {
  return(mean(rnorm(100, mean=score.mean, sd=score.sd)))
}

simulated.scores <- sapply(example.data$score_mean, 
                       FUN=score.simulate, 
                       score.sd=example.data$score_sd)

# [1] 0.4936432 0.3753853 0.6267956

This will generate one round (or column) of simulated values. However, I'd like to generate a lot of columns (like 100 or 1,000). The only way I've found to do this is to wrap my sapply() function inside a generic function inside lapply() and then convert the resulting list into a data frame with ldply() in plyr:

results.list <- lapply(1:5, FUN=function(x) sapply(example.data$score_mean, FUN=score.simulate, score.sd=example.data$score_sd))

library(plyr)
simulated.scores <- as.data.frame(t(ldply(results.list)))

#           V1        V2        V3        V4        V5
# V1 0.5047807 0.4902808 0.4857900 0.5008957 0.4993375
# V2 0.3996402 0.4128029 0.3875678 0.4044486 0.3982045
# V3 0.6017469 0.6055446 0.6058766 0.5894703 0.5960403

This works, but (1) it seems really convoluted, especially with the as.data.frame(t(ldply(lapply(... FUN=function(x) sapply ...)))) approach, (2) it is really slow when using large numbers of iterations or bigger data—my actual dataset has 3,000 rows, and running 1,000 iterations takes 1–2 minutes.

Is there a more efficient way to create a data frame of simulated values like this?

Was it helpful?

Solution

The quickest way I can think of is to take advantage of the vectorisation built-in to rnorm. Both the mean and sd arguments are vectorised, however you can only supply a single integer for the number of draws. If you supply a vector to the mean and sd arguments, R will cycle through them until it has completed the required number of draws. Therefore, just make the argument n to rnorm a multiple of the length of your mean vector. The multiplier will be the number of replicates for each row of your data.frame. In the function below this is n.

I can't think of a factor way than using base::rnorm on its own.

Worked example


#example data
df <- data.frame(country=c("a", "b", "c"), 
                           mean=c(1, 10, 100), 
                           sd=c(1, 2, 10))

#function which returns a matrix, and takes column vectors as arguments for mean and sd
normv <- function( n , mean , sd ){
    out <- rnorm( n*length(mean) , mean = mean , sd = sd )
    return( matrix( out , , ncol = n , byrow = FALSE ) )
    }

#reproducible result (note order of magnitude of rows and input sample data)
set.seed(1)
normv( 5 , df$mean , df$sd )
#           [,1]      [,2]       [,3]        [,4]        [,5]
#[1,]  0.3735462  2.595281   1.487429   0.6946116   0.3787594
#[2,] 10.3672866 10.659016  11.476649  13.0235623   5.5706002
#[3,] 91.6437139 91.795316 105.757814 103.8984324 111.2493092

OTHER TIPS

This can be done very quickly if you remember that rnorm(1, mean, sd) is the same as rnorm(1)*sd + mean so using your data frame df, you can generate sim simulations of your obs observations like:

obs = nrow(df)
sim = 1000
mat = data.frame(matrix(rnorm(obs*sim), obs, sim) * df$sd + df$mean)

You can check that this has the desired means by using rowMeans(mat) and check the standard deviation for, say, row 1 as sd(mat[1,]).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top