How to get bootstrapped p-values and bootstrapped t-values and how does the function boot() work?

https://stackoverflow.com/questions/8273812

09-03-2021
|

Question

I would like to get the bootstrapped t-value and the bootstrapped p-value of a lm. I have the following code (basically copied from a paper) which works.

# First of all you need the following packages
install.packages("car") 
install.packages("MASS")
install.packages("boot")
library("car")
library("MASS")
library("boot")

boot.function <- function(data, indices){
data <- data[indices,]
mod <- lm(prestige ~ income + education, data=data) # the liear model

# the first element of the following vector contains the t-value
# and the second element is the p-value
c(summary(mod)[["coefficients"]][2,3], summary(mod)[["coefficients"]][2,4])     
}

Now, I compute the bootstrapping model, which gives me the following:

duncan.boot <- boot(Duncan, boot.function, 1999)
duncan.boot

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Duncan, statistic = boot.function, R = 1999)


Bootstrap Statistics :
        original      bias    std. error
t1* 5.003310e+00 0.288746545  1.71684664
t2* 1.053184e-05 0.002701685  0.01642399

I have two questions:

My understanding is that the bootsrapped value is the original plus the bias, which means that both bootstrapped values (the bootstrapped t-value as well as the bootstrapped p-value) are greater than the original values. This in turn is not possible, because if the t-value rises (which means more significance) the p-values MUST be lower, right? Therefore I think that I have not yet really understood the output of the boot function (here: duncan.boot). How do I compute the bootstrapped values?
I do not understand how the boot() works. If you look at duncan.boot <- boot(Duncan, boot.function, 1999) you see that I have not passed any arguments for the function "boot.function". I suppose that R sets data <- Duncan. But since I have not passed anything for the argument "indices", I do not understand how the following line in the function "boot.function" works data <- data[indices,]

I hope the questions make sense!??

Solution

The boot function is "expecting" to get a function that has two arguments: the first being a data.frame and the second being an "indices" vector (possibly with duplicate entries and probably not using all the indices) to use in selecting rows and probably having some duplicate or triplicates.) It then samples with replacement determined by the pattern of duplicates and triplicates from the original dataframe (multiple times determined by "R" with different "choice sets"), passes those to the indices argument in the boot.function, and then collects the results of the R number of function applications.

Regarding what is reported by the print method for boot objects, take a look at this (done after examining the returned object with str()

> duncan.boot$t0
[1] 5.003310e+00 1.053184e-05
> apply(duncan.boot$t, 2, mean)
[1] 5.342895220 0.002607943
> apply(duncan.boot$t, 2, mean) - duncan.boot$t0
[1] 0.339585441 0.002597411

It becomes more obvious that the T0 value is from the original data while the bias is the difference between the mean of the boot()-ed values and the T0 values. I don't think it makes a lot of sense to be asking why p-values based on parametric considerations are increasing in association with an increase in estimated t-statistics. You are really in two disparate regions of statistical thought when you do that. I would have interpreted the increase in p-values as an effect of the sampling process, which does not take into account the Normal distribution assumptions. It is simply saying something about the sampling distribution of the p-value (which is really just another sample statistic).

(Comment: The sourcebook used at the time of R development was Davison and Hinkley's "Bootstrap Methods and their Applications". I'm no claiming any support for my answer above, but I thought to put it in as a reference after Hagen Brenner asked about sampling with two indices in the comments below. There are many unexpected aspects of bootstrapping that arise after one goes beyond the simple parametric estimation and I would first turn to that reference if I were tackling more complex sampling situations.)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow