R: Row resampling loop speed improvement

https://stackoverflow.com/questions/17118622

31-05-2022
|

Pregunta

I'm subsampling rows from a dataframe with c("x","y","density") columns at a variety of c("s_size","reps"). Reps= replicates, s_size= number of rows subsampled from the whole dataframe.

> head(data_xyz)
   x y density
1  6 1       0
2  7 1   17600
3  8 1   11200
4 12 1   14400
5 13 1       0
6 14 1    8000



 #Subsampling###################
    subsample_loop <- function(s_size, reps, int) {
      tm1 <- system.time( #start timer
    {
      subsample_bound = data.frame()
    #Perform Subsampling of the general 
    for (s_size in seq(1,s_size,int)){
      for (reps in 1:reps) {
        subsample <- sample.df.rows(s_size, data_xyz)
         assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
        subsample_replicate <- subsample[,] #temporary variable
        subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
                                     rep(reps,(length(subsample_replicate[,1]))))
        subsample_bound <- rbind(subsample_bound, subsample_replicate)

      }
    }
    }) #end timer
      colnames(subsample_bound) <- c("x","y","density","s_size","reps")
    subsample_bound
    } #end function

Here's the function call:

    source("R/functions.R")
    subsample_data <- subsample_loop(s_size=206, reps=5, int=10)

Here's the row subsample function:

# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...) 
  { 
    df[sample(nrow(df), N, replace=FALSE,...), ] 
  }

It's way too slow, I've tried a few times with apply functions and had no luck. I'll be doing somewhere around 1,000-10,000 replicates for each s_size from 1:250.

Let me know what you think! Thanks in advance.

========================================================================= UPDATE EDIT: Sample data from which to sample: https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv

Joran's code in a function (in a sourced function.R file):

foo <- function(i,j,data){
  res <- data[sample(nrow(data),i,replace = FALSE),]
  res$s_size <- i
  res$reps <- rep(j,i)
  res
}
resampling_custom <- function(dat, s_size, int, reps) {
  ss <- rep(seq(1,s_size,by = int),each = reps)
  id <- rep(seq_len(reps),times = s_size/int)
  out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}

Calling the function

set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)

outputs data, unfortunately with this warning message:

Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
  longer argument not a multiple of length of shorter

Solución

I put very little thought into actually optimizing this, I was just concentrating on doing something that's at least reasonable while matching your procedure.

Your big problem is that you are growing objects via rbind and cbind. Basically anytime you see someone write data.frame() or c() and expand that object using rbind, cbind or c, you can be very sure that the resulting code will essentially be the slowest possible way of doing what ever task is being attempted.

This version is around 12-13 times faster, and I'm sure you could squeeze some more out of this if you put some real thought into it:

s_size <- 200
int <- 10
reps <- 30

ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)

foo <- function(i,j,data){
    res <- data[sample(nrow(data),i,replace = FALSE),]
    res$s_size <- i
    res$reps <- rep(j,i)
    res
}

out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))

The best part about R is that not only is this way, way faster, it's also way less code.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow