Cross-validation: K-fold vs Repeated random sub-sampling

https://datascience.stackexchange.com/questions/511

16-10-2019
|

Question

I wonder which type of model cross-validation to choose for classification problem: K-fold or random sub-sampling (bootstrap sampling)?

My best guess is to use 2/3 of the data set (which is ~1000 items) for training and 1/3 for validation.

In this case K-fold gives only three iterations(folds), which is not enough to see stable average error.

On the other hand I don't like random sub-sampling feature: that some items won't be ever selected for training/validation, and some will be used more than once.

Classification algorithms used: random forest & logistic regression.

Solution

If you have an adequate number of samples and want to use all the data, then k-fold cross-validation is the way to go. Having ~1,500 seems like a lot but whether it is adequate for k-fold cross-validation also depends on the dimensionality of the data (number of attributes and number of attribute values). For example, if each observation has 100 attributes, then 1,500 observations is low.

Another potential downside to k-fold cross-validation is the possibility of a single, extreme outlier skewing the results. For example, if you have one extreme outlier that can heavily bias your classifier, then in a 10-fold cross-validation, 9 of the 10 partitions will be affected (though for random forests, I don't think you would have that problem).

Random subsampling (e.g., bootstrap sampling) is preferable when you are either undersampled or when you have the situation above, where you don't want each observation to appear in k-1 folds.

OTHER TIPS

I guess you say that you want to use 3-fold cross-validation because you know something about your data (that using k=10 would cause overfitting? I'm curious to your reasoning). I am not sure that you know this, if not then you can simply use a larger k.

If you still think that you cannot use standard k-fold cross-validation, then you could modify the algorithm a bit: say that you split the data into 30 folds and each time use 20 for training and 10 for evaluation (and then shift up one fold and use the first and the last 9 as evaluation and the rest as training). This means that you're able to use all your data.

When I use k-fold cross-validation I usually run the process multiple times with a different randomisation to make sure that I have sufficient data, if you don't you will see different performances depending on the randomisation. In such cases I would suggest sampling. The trick then is to do it often enough.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange