Pergunta

I'd like to cite a paragraph from the book Hands On Machine Learning with Scikit Learn and TensorFlow by Aurelien Geron regarding evaluating on a final test set after hyperparameter tuning on the training set using k-fold cross validation:

"The performance will usually be slightly worse than what you measured using cross validation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets). It is not the case in this example, but when this happens you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data."

-Chapter 2: End-to-End Machine Learning Project

I am confused because he said that when the test score is WORSE the cross validation score (on the training set), you should not tweak the hyperparameters to make the testing score better. But isn't that the purpose of having a final test set? What's the use of evaluating a final test set if you can't tweak your hyperparameters if the test score is worse?

Foi útil?

Solução

In "The Elements of Statistical Learning" by Hastie et al the authors describe two tasks regarding model performance measurement:

Model selection: estimating the performance of different models in order to choose the best one.

Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

Validation with CV (or a seperate validation set) is used for model selection and a test set is usually used for model assessment. If you did not do model assessment seperately you would most likely overestimate the performance of your model on unseen data.

Outras dicas

So that we are on the same page, some prerequisites

Suppose we had only 2 splits train and test. Now when we will tune our hyperparameters using the test split, we are trying to increase the accuracy(or any other metric). Though our model is not trained on the test set, but we are making it perform well on the test set, in a way the model gets the information about our test set(it is like training on the test set). So now our model is kind of overfitting to the train and the test set. That is why we split our data into 3 parts .i.e. train-validation-test.

Now to answer your question:

I think the scenario the book author wants to mention is when, the validation set does not completely represent the entire distribution the model is being trained on, thus performing hyperparamter tuning kind of overfits the model on the validation set and thus giving poor performance on the test set. I think if the validation set completely represents the entire distribution(or the test set rather), test set accuracy will always increase if we perform hyper parameter tuning on the validation set.

Licenciado em: CC-BY-SA com atribuição
scroll top