Shuffle the data before splitting into folds

https://datascience.stackexchange.com/questions/68828

09-12-2020
|

Pergunta

I am running a 4-folds cross validation hyperparameter tuning using sklearn's 'cross_validate' and 'KFold' functions. Assuming that my training dataset is already shuffled, then should I for each iteration of hyperpatameter tuning re-shuffle the data before splitting into batches/folds (i.e., the shuffle argument in the KFold function)? I noticed that the outcome of the hyperparameter tuning process will be different depending on if I am shuffling the data prior to splitting it into folds.

I assume that if the outcome depends on shuffling then the model is not robust. Is this correct? However, this also may not be 'fair' to the model because the result is not reproducible since the data for each fold is different every time I run cross validation (i.e., each hyperparametr combination is evaluated on totally different folds. For example, the training/validation dataset in fold #1 of the 1st tuning iteration is different than fold #1 dataset of the 2nd tuning iteration.)

Solução

This is a pretty good question. Using the same fold-splits for every hyperparameter point makes it possible to overfit the hyperparameters to the data-split. However, using different fold-splits for each hyperparameter point makes the comparisons between them not (exactly) apples-to-apples.

I think setting the same folds for each hyperparameter is better. As reference, note that sklearn's xyzSearchCV functions perform that way: they take the product of search points with folds and fit on every one of those combinations. You can alleviate the overfit-to-split issue with repeated k-fold.

Outras dicas

I am running a 4-folds cross validation hyperparameter tuning using sklearn's 'cross_validate' and 'KFold' functions. Assuming that my training dataset is already shuffled, then should I for each iteration of hyperpatameter tuning re-shuffle the data before splitting into batches/folds (i.e., the shuffle argument in the KFold function)?

No, its no needed, shuffling is needed before split.

I assume that if the outcome depends on shuffling then the model is not robust. Is this correct?

You are right, good model has great performing on every combination of data.

However, this also may not be 'fair' to the model because the result is not reproducible since the data for each fold is different every time I run cross-validation (i.e., each hyperparameter combination is evaluated on totally different folds. For example, the training/validation dataset in fold #1 of the 1st tuning iteration is different than fold #1 dataset of the 2nd tuning iteration.)

You make cross-validation to check performance on every chunk of data. Your task is achieved most generous model.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange