문제

Data set looks like:

  • 25000 observations
  • up to 15 predictors of different types: numeric, multi-class categorical, binary
  • target variable is binary

Which cross validation method is typical for this type of problems?

By default I'm using K-Fold. How many folds is enough in this case? (One of the models I use is random forest, which is time consuming...)

도움이 되었습니까?

해결책

You will have best results if you care to build the folds so that each variable (and most importantly the target variable) is approximately identically distributed in each fold. This is called, when applied to the target variable, stratified k-fold. One approach is to cluster the inputs and make sure each fold contains the same number of instances from each cluster proportional to their size.

다른 팁

I think in your case a 10-fold CV will be O.K.

I think it is more important to randomize the cross validation process than selecting the ideal value for k.

So repeat the CV process several times randomly and compute the variance of your classification result to determine if the results are realiable or not.

I have to agree that k-fold should do "just" fine. However, there is a nice article about the "Bootstrap .632+" method (basically a smoothened cross validation) that is supposed to be superior (however, they did the comparisons on not-binary data as far as I can tell)

Maybe you want to check out this article here: http://www.jstor.org/stable/2965703

K-Fold should do just fine for binary classification problem. Depending on the time it is taking to train your model and predict the outcome I would use 10-20 folds.

However sometimes a single fold takes several minutes, in this case I use 3-5 folds but not less than 3. Hope it helps.

To be honest binary classification is the easiest type compared to multi-class classification as at times by error you can classify a wrong class to a right one.So if you have a dataset with multiclass you will need a good distribution among them,so the expectation is more sample will give better insight,i.e CV should be less.However in case of binary classification if your class distribution is balanced enough you can easily go fo CV=10 for 25k observations,however if the class distribution is skewed you better go with less CV.

So in a nutshell in case of binary distribution CV value really depends on your class distribution and not much on number of observations.

Unless the label distribution is balanced, stratified sampling of folds will give you a better estimate of performance than random sampling.

Also, try to avoid that correlated samples end up in different folds. Otherwise your models are likely overfitted and the error is underestimated. For example, if your data contains temporal correlation, always split by time.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top