Does it make sense to use train_test_split and cross-validation when using GridSearchCV to play with hyperparameters?

https://datascience.stackexchange.com/questions/67740

08-12-2020
|

Pregunta

I was wondering if my methodology makes sense. I am using GridSearchCV with cross-validation to train and tune model hyperparameters for a bunch of different model types (e.g. Regression Trees, Ridge, Elastic net, etc.). Before fitting the models I leave out 10% of the sample for model validation using train_test_split. (see Screenshot). I select the models with the best parameters to make predictions on the unseen validation set.

Am I missing something, as I haven't seen someone doing this when evaluating model accuracy while tuning for model parameters?

Solución

Yes it makes sense. Your "validation" set is typically called the "test" set (at least in my experience). CV is creating train/validation sets to pick the hyperparams.

The most complete process would be to then build a model one more time with the best hyperparams on the 90%, and evaluate on the 10%. This evaluation gives you a reliable estimate of the generalization error. It can be slightly lower than the error you got from the the CV process for the same parameters; the hyper-parameter fitting can itself 'overfit' a bit.

Some people skip this last step, if you don't care about estimating generalization error. Whatever it is, it's still the best you can do, according to your tuning process. The upside of skipping it is, I suppose, 10% more training data.

Otros consejos

I'm wondering if this process could be even better:

find the best params with GridSearchCV on the WHOLE dataset (no train_test_split)
evaluate the model with the best params with k-fold Cross Validation (multiple train_test_split) calculating mean values of the metrics

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange