bootstrapping hierarchical/multilevel data (resampling clusters)

Question 1

If I understand you correctly, this what you are trying to do with c.data as input:

Resample clusters with replacement
Maintain the association between each cluster in the random sample and its points from the original data set (i.e. c.data)
Create a bootstrap with the sampled clusters

Here is a script that achieve this which you can wrap into a function to repeat it R times, where R is the number of bootstrap replicates

q <- rep(1:24, times=6)
cluster <- sample(q)
c.data <- cbind(cats, cluster)

# get a vector with all clusters
c <- sort(unique(c.data$cluster))

# group the data points per cluster
clust.group <- function(c) {
    c.data[c.data$cluster==c,]
}

clust.list <- lapply(c,clust.group)

# resample clusters with replacement
c.sample <- sample(c, replace=T)

clust.sample <- clust.list[c.sample]

clust.size <- 6

# combine the cluster list back to a single data matrix
clust.bind <- function(c) {
    matrix(unlist(c),nrow=clust.size)
}

c.boot <- do.call(rbind,lapply(clust.sample,clust.bind))

# Just to maintain columns name
colnames(c.boot) <- names(c.data)

# the new data set (single bootstrap replicate)
c.boot

Question 2

I tried to solve the problem in the following way. Although it works, it can be probably improved in terms of speed and "elegance". Also, if possible I would have preferred to find a way for using the -boot- package, as it allows to automatically compute a number of bootstrapped confidence intervals through boot.ci...

For simplicity, the starting dataset consists in 18 cats (the "lower-level" observations) nested in 6 laboratories (the clustering variable). The dataset is balanced (n_cluster = 3 for each cluster). We have one regressor, x, for explaining y.

The fake dataset and the matrix where to store results are:

  # fake sample 
  dat <- expand.grid(cat=factor(1:3), lab=factor(1:6))
  dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))

  # empty matrix for storing coefficients estimates and standard errors of x
  B <- 50 # number of bootstrap samples
  b.sample <- matrix(nrow=B, ncol=3, dimnames=list(c(), c("sim", "b_x", "se_x")))
  b.sample[,1] <- rep(1:B)

At each of the B iterations, the following loop samples 6 clusters with replacement, each composed by 3 cats sampled without replacement (i.e. the clusters' internal composition is maintained unaltered). The estimates of the regressor coefficient and of its standard error are stored in the previously created matrix:

  ####################################
  #   loop through "b.sample" rows   #
  ####################################

  for (i in seq(1:B)) {

  ###   sampling with replacement from the clustering variable   

    # sampling with replacement from "cluster" 
    cls <- sample(unique(dat$lab), replace=TRUE)
    cls.col <- data.frame(lab=cls)

    # reconstructing the overall simulated sample
    cls.resample <- merge(cls.col, dat, by="lab")


  ###   fitting linear model to simulated data    

    # model fit
    mod.fit <- function(data) glm(data$y ~ data$x)

    # estimated coefficients and standard errors
    b_x <- summary(mod.fit(data=cls.resample))$coefficients[2,1]
    se_x <- summary(mod.fit(data=cls.resample))$coefficients[2,2]

    b.sample[i,2] <- b_x
    b.sample[i,3] <- se_x

  }

The final bootstrapped standard error is:

 boot_se_x <- sum(b.sample[,3])/(B-1) 
 boot_se_x

Is there any way for adopting the -boot- package for doing this?

Also, with respect to the adoption of fixed-effects at cluster level instead of the simple linear regression, my main doubt is that in some simulated samples it can happen that we have not selected some of the initial clusters, so that the related cluster-specific intercepts cannot be identified. If you have a look at the code I posted, it should not be a problem from a "mechanical" point of view (at each iteration we can fit a different FE model with just the sampled clusters' intercepts).

I was wondering whether there is a statistical issue in all of this