I tried to solve the problem in the following way.
Although it works, it can be probably improved in terms of speed and "elegance". Also, if possible I would have preferred to find a way for using the -boot-
package, as it allows to automatically compute a number of bootstrapped confidence intervals through boot.ci
...
For simplicity, the starting dataset consists in 18 cats (the "lower-level" observations) nested in 6 laboratories (the clustering variable). The dataset is balanced (n_cluster = 3
for each cluster). We have one regressor, x
, for explaining y
.
The fake dataset and the matrix where to store results are:
# fake sample
dat <- expand.grid(cat=factor(1:3), lab=factor(1:6))
dat <- cbind(dat, x=runif(18), y=runif(18, 2, 5))
# empty matrix for storing coefficients estimates and standard errors of x
B <- 50 # number of bootstrap samples
b.sample <- matrix(nrow=B, ncol=3, dimnames=list(c(), c("sim", "b_x", "se_x")))
b.sample[,1] <- rep(1:B)
At each of the B
iterations, the following loop samples 6 clusters with replacement, each composed by 3 cats sampled without replacement (i.e. the clusters' internal composition is maintained unaltered). The estimates of the regressor coefficient and of its standard error are stored in the previously created matrix:
####################################
# loop through "b.sample" rows #
####################################
for (i in seq(1:B)) {
### sampling with replacement from the clustering variable
# sampling with replacement from "cluster"
cls <- sample(unique(dat$lab), replace=TRUE)
cls.col <- data.frame(lab=cls)
# reconstructing the overall simulated sample
cls.resample <- merge(cls.col, dat, by="lab")
### fitting linear model to simulated data
# model fit
mod.fit <- function(data) glm(data$y ~ data$x)
# estimated coefficients and standard errors
b_x <- summary(mod.fit(data=cls.resample))$coefficients[2,1]
se_x <- summary(mod.fit(data=cls.resample))$coefficients[2,2]
b.sample[i,2] <- b_x
b.sample[i,3] <- se_x
}
The final bootstrapped standard error is:
boot_se_x <- sum(b.sample[,3])/(B-1)
boot_se_x
Is there any way for adopting the -boot-
package for doing this?
Also, with respect to the adoption of fixed-effects at cluster level instead of the simple linear regression, my main doubt is that in some simulated samples it can happen that we have not selected some of the initial clusters, so that the related cluster-specific intercepts cannot be identified. If you have a look at the code I posted, it should not be a problem from a "mechanical" point of view (at each iteration we can fit a different FE model with just the sampled clusters' intercepts).
I was wondering whether there is a statistical issue in all of this