Does k fold cross validation become less useful when number of observations is very large?

https://datascience.stackexchange.com/questions/68927

09-12-2020
|

Question

As seen in the accepted answer for variance of k-fold cross validation , the simulation shows that k-fold CV has the same test error rate for different values of k when n=200. Does this mean that k-fold validation is likely to be as good as having a holdout set validation? (assuming I have abundant data to make up for the high bias for holdout set validation approach)

Apart from high bias, the problem with holdout set validation approach as described in ISL book is that the test error rate is sensitive to the random splitting of data between train and validate. My intuition is that, With very high n (and well spread out data), the problem due to random splitting seems less likely to occur.

Solution

Yes, you are right that when number of observations are very large, k fold cross validation (CV) are less useful. Let's look at why this is so:

1) Very high number of observations imply high training time for model and validation. Already the number of observations is large for the model to be trained and validated and now we are demanding it to be done k times. This is a huge burden on resources that is why in the Deep Learning regime we don't generally follow k fold CV as the data needed for training good neural networks are very high compared to traditional ML algorithms.

2) The higher the number of observations, the higher is the number of data chosen for cross validation set. This inherently makes it less likely that the sampled data points do not represent the original distribution. As you would know, the more data we sample, the better we approximate the original distribution.

Because of these reasons, k fold CV is inefficient when the number of observations is very high so a hold out set for CV will do the job.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange