Question

I am working with hydrological time series data and I am attempting to construct Bootstrap Artificial Neural Network models. In order to provide an uncertainty assessment using confidence intervals, one must make sure when resampling/Bootstrapping the original time series data set, that every value in the original time series is held back at least twice within all bootstrap samples in order to calculate the variance and confidence intervals at that point in time.

To give some background:

I am using a hydrological time series that contains Standard Precipitation Index values at monthly time steps, this time series spans 429 (rows) x 1 (column), let's call this time series vector X. All elements/values of X are normalized and standardized between 0 and 1.

Time series X is then trained against some Target values (same length and conditions as X) in a Neural Network to produce new estimates of the Target values, we'll call this output vector, O (same length and conditions as X).

I am now to take X and resample it ii =1:1:200 times (i.e. Bootstrap size = 200) for length(429) with replacement. Let's call the matrix where all the bootstrap samples are placed, M. I use B = randsample(X, length(X), true) and fill M using a for loop such that M(:,ii) = B. Note: I also make sure to include rng('shuffle') after my randsample statement to keep the RNG moving to new states in hopes that it will provide more random results.

Now I am to test how "well" my data was resampled for use in creating confidence intervals.

My procedure is as follow:

  1. Generate a for loop to create M using above procedure
  2. Create a new variable Xc, this will hold all values of X that were not resampled in bootstrap sample ii for ii = 1:1:200
  3. For j=1:1:length(X) fill 'Xc' using the Xc(j,ii) = setdiff(X, M(:,ii)), if element j exists in M(:,ii) fill Xc(j,ii) with NaN.
  4. Xc is now a matrix the same size and dimensions as M. Count the amount of NaN values in each row of Xc and place in vector CI.
  5. If any row in CI is > [Bootstrap sample size, for this case (200) - 1], then no confidence interval can be created at this point.

When I run this I find that the values chosen from my set X are almost always repeated, i.e. the same values of X are used to generate all the samples in M. It's roughly the same ~200 data points in my original time series that are always chosen to create the new bootstrap samples.

How can I effectively alter my program or use any specific functions that will allow me to avoid the negative solution in (5)?

Here is an example of my code, but please keep in mind the variables used in the script may differ from my text in here.

Thank you for the help and please see the code below.

for ii = 1:1:Blen % for loop to create 'how many bootstraps we desire'
    B = randsample(Xtrain, wtrain, true); % bootstrap resamples of data series 'X' for 'how many elements' with replacement
    rng('shuffle');
    M(:,ii) = B; % creates a matrix of all bootstrap resamples with respect to the amount created by the for loop
    [C,IA] = setdiff(Xtrain,B); % creates a vector containing all elements of 'Xtrain' that were not included in bootstrap sample 'ii' and the location of each element
    [IAc] = setdiff(k,IA); % creates a vector containing locations of elements of 'Xtrain' used in bootstrap sample 'ii' --> ***IA + IAc = wtrain***

    for j = 1:1:wtrain % for loop that counts each row of vector
            if ismember(j,IA)== 1 % if the count variable is equal to a value of 'IA'
                XC(j,ii) = Xtrain(j,1); % place variable in matrix for sample 'ii' in position 'j' if statement above is true
            else
                XC(j,ii) = NaN; % hold position with a NaN value to state that this value has been used in bootstrap sample 'ii'
            end
            dum1(:,ii) = wtrain - sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAs' limited by 'isnan' --> used to calculate amt of elements in IA
            dum2(:,ii) = sum(isnan(XC(:,ii))); % dummy variable to permit transposing of 'IAsc' limited by 'isnan' 
            IAs = transpose(dum1) ; % variable counting amount of elements not resampled in 'M' at set 'i', ***i.e. counts 'IA' for each resample set 'i'
            IAsc = transpose(dum2) ; % variable counting amount of elements resampled in 'M' at set 'i', ***i.e. counts 'IAc' for each resample set 'i'
            chk = isnan(XC); % returns 1 in position of NaN and 0 in position of actual value
            chks = sum(chk,2); % counts how many NaNs are in each row for length of time training set
            chks_cnt = sum(chks(:)<(Blen-1)); % counts how many values of the original time series that can be provided a confidence interval, should = wtrain to provide complete CIs
    end
end
Was it helpful?

Solution

This doesn't appear to be a problem with randsample, but rather a problem in your other code somewhere. randsample does the right thing. For example:

x = (1:10)';
nSamples = 10;
for iter = 1:100; 
   data(:,iter) =  randsample(x,nSamples ,true); 
end; 

hist(data(:)) %this is approximately uniform

randsample samples quite randomly...

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top