Question

I have a dataset, for simplicity let's say it has 1000 samples (each is a vector).

I want to split my data for cross validation, for train and test, NOT randomly1, so for example if I want 4-fold cross validation, I should get:

fold1: train = 1:250; test= 251:1000
fold2: train = 251:500, test = [1:250 ; 501:1000]
fold3: train = 501:750, test = [1:500; 751:1000]
fold4: train = 751:1000, test = 1:750

I am aware of CVPARTITION, but AFAIK - it splits the data randomly - which is not what I need.

I guess I can write the code for it, but I figured there is probably a function I could use.


(1) The data is already shuffled and I need to be able to easily reproduce the experiments.

Was it helpful?

Solution

Here is a function that does it in general:

function [test, train] = kfolds(data, k)

  n = size(data,1);

  test{k,1} = [];
  train{k,1} = [];

  chunk = floor(n/k);

  test{1} = data(1:chunk,:);
  train{1} = data(chunk+1:end,:);

  for f = 2:k
      test{f} = data((f-1)*chunk+1:(f)*chunk,:);
      train{f} = [data(1:(f-1)*chunk,:); data(f*chunk+1:end, :)];
  end
end

It's not an elegant 1 liner, but it's fairly robust, doesn't need k to be a factor of your number of samples, works on a 2D matrix and outputs the actual sets rather than indices.

OTHER TIPS

Assuming you have k*n smaples you want to divide to k folds with n samples in train and (k-1)*n in test (in your question k = 4, n = 250).
Then

 >> foldId = kron( 1:k, ones(1,n) );

foldId gives you the index of the training fold each sample belongs to.

For fold f you can get the indices of training and test samples using

 >> trainIdx = find( foldId == f );
 >> testIdx  = find( foldId ~= f );

(You can use logical indexing instead of the find and speed things up a bit).

To divide the dataset into k folds of length n you can use:

f=arrayfun(@(x)struct('train',x*n+(1:n),'test',setdiff(1:n*k,x*n+(1:n))), 0:k-1);

where f is a structure array with fields train and test containing the indices for corresponding fold. For example for n=5 and k=3 and fold 2:

>> f(2).train
ans =
     6     7     8     9    10
>> f(2).test
ans =
     1     2     3     4     5    11    12    13    14    15

You can even extract data directly. Let's say your data is a 2D matrix of n*k rows

E=arrayfun(...
@(x) struct('train', D(x*n+(1:n),:), ...
            'test',  D(setdiff(1:n*k, x*n+(1:n)),:)), 0:k-1)

Say your data is

D = [(1:15).^2; (1:15).^3].';

For fold 2, E contains:

>> E(2).train
ans =
          36         216
          49         343
          64         512
          81         729
         100        1000
>> E(2).test
ans =
           1           1
           4           8
           9          27
          16          64
          25         125
         121        1331
         144        1728
         169        2197
         196        2744
         225        3375
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top