Pergunta

I want to bootstrap a data set that has groups in it. A simple scenario would be bootstrapping simple means:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
boot(data, stat, R = 10)

This gives me the error incorrect number of subscripts on matrix, because of by = "group" part. I managed to solve it using subsetting, but don't like this solution. Is there simpler way to make this kind of task work?

In particular, I'd like to introduce an additional argument in the statistics function like stat(x, i, groupvar) and pass it to the boot function like boot(data, stat(groupvar = group), R = 100)?

Foi útil?

Solução 2

This should do it:

data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Outras dicas

Using

 boot       * 1.3-18  2016-02-23 CRAN (R 3.2.3)                        
 data.table * 1.9.7   2015-10-05 Github (Rdatatable/data.table@d607425)

I received an error using the OP's code with the answer supplied by @eddi:

data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2)), by = "group"]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Produces the error message:

Error in eval(expr, envir, enclos) : object 'group' not found 

The error is fixed by removing by=group from the function stat:

set.seed(1000)
data <- as.data.table(list(x1 = runif(200), x2 = runif(200), group = runif(200)>0.5))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which produces the following Bootstrap Statistics results:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original       bias    std. error
t1* 0.5158232  0.004930451  0.01576641
t2* 0.5240713 -0.001851889  0.02851483

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5142383 -0.0072475030  0.02568692
t2* 0.5291694 -0.0001509404  0.02378447

Below, I modify the sample dataset to highlight which Bootstrap Statistic goes with which group-column combination:

Consider group 1 which has a mean value of 10 for x1 and a mean value of 10000 for x2 and group 2 which has a mean value of 2000 for x1 and a mean value of 8000 for x2:

data2 <- as.data.table(list(x1 = c(runif(100, 9,11),runif(100, 1999,2001)), x2 = c(runif(100, 9999,10001),runif(100, 7999,8001)), group = rep(c(1,2), each=100)))
stat <- function(x, i) {x[i, c(m1 = mean(x1), m2 = mean(x2))]}
data2[, list(list(boot(.SD, stat, R = 10))), by = group]$V1

Which gives:

[[1]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
      original       bias    std. error
t1*   10.00907  0.007115938  0.04349184
t2* 9999.90176 -0.019569568  0.06160653

[[2]]

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = .SD, statistic = stat, R = 10)


Bootstrap Statistics :
    original       bias    std. error
t1* 1999.965  0.031694179  0.06561209
t2* 8000.110 -0.006569872  0.03992401

Lots of problems in your code before you even get to the by group part.

Did you mean something like this?

data <- as.data.frame(list(x1 = runif(200), x2 = runif(200), group = factor(sample(letters[1:2]))))
stat <- function(x, i)  c(m1 = mean(x$x1[i]), m2 = mean(x$x2[i]))

> stat(x,1:10)
       m1        m2 
0.4465738 0.5522221 

Then from there you can worry about doing it by group however you choose to.

For instance:

library(plyr)
dlply( data, .(group), function( dat ) boot(dat, stat, R=10) )

For bigger datasets, try data.table:

by( seq(nrow(data)), data$group, function(idx) myboot(data[idx,]))

I went with by() rather than the data.table's ,by= argument because you want the output to be a list. There may be some functionality I don't know about for doing that, but I couldn't find it (see the edit history for the problem it was causing).

The subsetting is still done via the data.table's [] method, so it should be plenty fast.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top