Pergunta

I am producing a script for creating bootstrap samples from the cats dataset (from the -MASS- package).

Following the Davidson and Hinkley textbook [1] I ran a simple linear regression and adopted a fundamental non-parametric procedure for bootstrapping from iid observations, namely pairs resampling.

The original sample is in the form:

Bwt   Hwt

2.0   7.0
2.1   7.2

...

1.9    6.8

Through an univariate linear model we want to explain cats hearth weight through their brain weight.

The code is:

library(MASS)
library(boot)


##################
#   CATS MODEL   #
##################

cats.lm <- glm(Hwt ~ Bwt, data=cats)
cats.diag <- glm.diag.plots(cats.lm, ret=T)


#######################
#   CASE resampling   #
#######################

cats.fit <- function(data) coef(glm(data$Hwt ~ data$Bwt)) 
statistic.coef <- function(data, i) cats.fit(data[i,]) 

bootl <- boot(data=cats, statistic=statistic.coef, R=999)

Suppose now that there exists a clustering variable cluster = 1, 2,..., 24 (for instance, each cat belongs to a given litter). For simplicity, suppose that data are balanced: we have 6 observations for each cluster. Hence, each of the 24 litters is made up of 6 cats (i.e. n_cluster = 6 and n = 144).

It is possible to create a fake cluster variable through:

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

I have two related questions:

How to simulate samples in accordance with the (clustered) dataset strucure? That is, how to resample at the cluster level? I would like to sample the clusters with replacement and to set the observations within each selected cluster as in the original dataset (i.e. sampling with replacenment the clusters and without replacement the observations within each cluster).

This is the strategy proposed by Davidson (p. 100). Suppose we draw B = 100 samples. Each of them should be composed by 24 possibly recurrent clusters (e.g. cluster = 3, 3, 1, 4, 12, 11, 12, 5, 6, 8, 17, 19, 10, 9, 7, 7, 16, 18, 24, 23, 11, 15, 20, 1), and each cluster should contain the same 6 observations of the original dataset. How to do that in R? (either with or without the -boot- package.) Do you have alternative suggestions for proceeding?

The second question concerns the initial regression model. Suppose I adopt a fixed-effects model, with cluster-level intercepts. Does it change the resampling procedure adopted?

[1] Davidson, A. C., Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge University press.

Foi útil?

Solução

If I understand you correctly, this what you are trying to do with c.data as input:

  • Resample clusters with replacement
  • Maintain the association between each cluster in the random sample and its points from the original data set (i.e. c.data)
  • Create a bootstrap with the sampled clusters

Here is a script that achieve this which you can wrap into a function to repeat it R times, where R is the number of bootstrap replicates

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

# get a vector with all clusters
c <- sort(unique(c.data$cluster))

# group the data points per cluster
clust.group <- function(c) {
    c.data[c.data$cluster==c,]
}

clust.list <- lapply(c,clust.group)

# resample clusters with replacement
c.sample <- sample(c, replace=T)

clust.sample <- clust.list[c.sample]

clust.size <- 6

# combine the cluster list back to a single data matrix
clust.bind <- function(c) {
    matrix(unlist(c),nrow=clust.size)
}

c.boot <- do.call(rbind,lapply(clust.sample,clust.bind))

# Just to maintain columns name
colnames(c.boot) <- names(c.data)

# the new data set (single bootstrap replicate)
c.boot

Outras dicas

I tried to solve the problem in the following way. Although it works, it can be probably improved in terms of speed and "elegance". Also, if possible I would have preferred to find a way for using the -boot- package, as it allows to automatically compute a number of bootstrapped confidence intervals through boot.ci...

For simplicity, the starting dataset consists in 18 cats (the "lower-level" observations) nested in 6 laboratories (the clustering variable). The dataset is balanced (n_cluster = 3 for each cluster). We have one regressor, x, for explaining y.

The fake dataset and the matrix where to store results are:

  # fake sample 
  dat <- expand.grid(cat=factor(1:3), lab=factor(1:6))
  dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))

  # empty matrix for storing coefficients estimates and standard errors of x
  B <- 50 # number of bootstrap samples
  b.sample <- matrix(nrow=B, ncol=3, dimnames=list(c(), c("sim", "b_x", "se_x")))
  b.sample[,1] <- rep(1:B)

At each of the B iterations, the following loop samples 6 clusters with replacement, each composed by 3 cats sampled without replacement (i.e. the clusters' internal composition is maintained unaltered). The estimates of the regressor coefficient and of its standard error are stored in the previously created matrix:

  ####################################
  #   loop through "b.sample" rows   #
  ####################################

  for (i in seq(1:B)) {

  ###   sampling with replacement from the clustering variable   

    # sampling with replacement from "cluster" 
    cls <- sample(unique(dat$lab), replace=TRUE)
    cls.col <- data.frame(lab=cls)

    # reconstructing the overall simulated sample
    cls.resample <- merge(cls.col, dat, by="lab")


  ###   fitting linear model to simulated data    

    # model fit
    mod.fit <- function(data) glm(data$y ~ data$x)

    # estimated coefficients and standard errors
    b_x <- summary(mod.fit(data=cls.resample))$coefficients[2,1]
    se_x <- summary(mod.fit(data=cls.resample))$coefficients[2,2]

    b.sample[i,2] <- b_x
    b.sample[i,3] <- se_x

  }

The final bootstrapped standard error is:

 boot_se_x <- sum(b.sample[,3])/(B-1) 
 boot_se_x

Is there any way for adopting the -boot- package for doing this?

Also, with respect to the adoption of fixed-effects at cluster level instead of the simple linear regression, my main doubt is that in some simulated samples it can happen that we have not selected some of the initial clusters, so that the related cluster-specific intercepts cannot be identified. If you have a look at the code I posted, it should not be a problem from a "mechanical" point of view (at each iteration we can fit a different FE model with just the sampled clusters' intercepts).

I was wondering whether there is a statistical issue in all of this

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top