Domanda

I've just started to try out a nice bootstrapping package available through scikits: https://github.com/cgevans/scikits-bootstrap

but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.

Here is the code:

import numpy as np
from scipy import stats
import bootstrap as boot

np.random.seed(0)
x     = np.arange(10)
y     = 10 + 1.5*x + 2*np.random.randn(10)
r0    = stats.linregress(x, y)[2]

def my_function(y):
    return stats.linregress(x, y)[2]

ci    = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')

This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.

I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.

Please help!

È stato utile?

Soluzione

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.

The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.

Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:

def my_function(xysample):
    return stats.linregress(xysample[:,0], xysample[:,1])[2]

Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:

def my_function2(i):
    return stats.linregress(x[i], y[i])[2]

boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')

Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.

I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.


Update:

In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:

cis = boot.ci( (x,y), statfunction=stats.linregress )

cis[:,2] in this case will be the desired confidence interval.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top