Bootstrap two-way large dataset [duplicate]

https://stackoverflow.com/questions/13109383

14-07-2021
|

Question

Possible Duplicate:
Bootstrap a large data set

I would like to bootstrap a large two-way data set which contains multiple column and row variables. I have to preserve both row and column variables. The result should be a list containing a bootstrap of all column variables for each row variable. I am providing the required code to answer my question but i think it is not elegant. I would appreciate a better and faster code. The following is a simplified re-creation of the two-way data set:

rm(list=ls())

data <- 1:72

Create a two way matrix data:

charDataDiff <- matrix(data, nrow=9,ncol=8)
varNames <- c("A", "B", "C","A", "B", "C","A", "B", "C")

Add a character column to the charDataDiff matrix:

charDataDiff <- cbind(varNames ,data.frame(charDataDiff))

Add column names:

colnames(charDataDiff) <- c("patchId","s380","s390","s400","s410","s420","s430","s440","s450")

Separate the data using the row-variable "patchId" as the criteria. This creates three lists: one for each Variable

idColor <-  c("A", "B", "C")

(patchSpectrum <- lapply(idColor, function(idColor) charDataDiff[charDataDiff$patchId==idColor,]))

Created the function sampleBoot to sample the patchSpectrum

sampleBoot <-  function(nbootstrap=2, patch=3){
    return(lapply(1:nbootstrap, function(i)
             {patchSpectrum[[patch]][sample(1:nrow(patchSpectrum[[patch]]),replace=TRUE),]}))}

The list "k" answers my question. However, I think my code is slow for a large data set and large bootstrap. I am only bootstrapping 10 iteration for three row variables. A faster more elegant code is appreciated.

numBoots <- 10
for (i in 1: numBoots)
        k <- lapply(1:3, function(n)
                    do.call(rbind, lapply(sampleBoot(i, n), function(x) apply(x[-1], 2, median))))
k

Solution

What I can provide is a clean rewrite of your code:

I broke up long pieces into appropriately named functions, so it reads better,
I got rid of the varNames column, instead used by to split your data as @KenWilliams had suggested in your previous question,
I made use of replicate instead of lapply,
I got rid of the unnecessary for loop I pointed out in my comment.

It should run faster that way; if not I'd suggest you try profiling it.

charDataDiff <- matrix(1:72, nrow = 9, ncol = 8)
colnames(charDataDiff) <- c("s380", "s390", "s400", "s410",
                            "s420", "s430", "s440", "s450")

varNames      <- c("A", "B", "C", "A", "B", "C", "A", "B", "C")
patchSpectrum <- by(charDataDiff, varNames, data.frame)

sampleOne   <- function(x) x[sample(seq_len(nrow(x)), replace = TRUE), ]
sampleBoot  <- function(x, n) replicate(n, sampleOne(x), simplify = FALSE)
applyMedian <- function(l) do.call(rbind, lapply(l, apply, 2, median))

k <- lapply(lapply(patchSpectrum, sampleBoot, n = 10), applyMedian)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow