Variance in cross validation score / model selection

https://datascience.stackexchange.com/questions/15394

16-10-2019
|

Question

Between cross-validation runs of a xgboost classification model, I gather different validation scores. This is normal, the Train/validation split and model state are different each time.

flds = self.gsk.Splits(X, cv_folds=cv_folds)
cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=xgb_param['n_estimators'], nfold=cv_folds, flds, metrics='auc', early_stopping_rounds=50, verbose_eval=True)

self.model.set_params(n_estimators=cvresult.shape[0])

To make the parameters selection, I run multiple times this CV and average the results in order to attenuate those differences.

Once my model parameters have been "found", what is the correct way to train the model, which seems to have some inner random states ?

Do I :

train on the full train set and hope for the best?
keep the model with the best validation score in my CV loop (I am concerned this will overfit)?
bag all of them?
bag only the good ones?

Solution

Since you want your model to be a general solution, you want to include all your data when building the final model. You are correct in saying that keeping the model with the best validation score in the CV is overfitting. Including these inner random states help generalize your model, and since you have already tuned your model parameters using CV, you can apply these parameters to the final model.

As for feature selection, you want to separate the data used to perform feature selection and the data used in cross-validation, so feature selection is performed on independent data in the cross-validation fold. This prevents biasing the model. If you were to select your features on the same data that you then use to cross-validate, you will likely overestimate your accuracy.

Here are some other great posts that help: https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation

https://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation

Check out Dikran Marsupial's answers to both, they are really good.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange