Pergunta

I am training an Xgboost using 60% of my data and use 40% for testing.

In the 60% of data, I use 5-fold validation to find the best number of trees. I find that the optimal number of trees is around 150.

I evaluate my model in the 40% of the data that I have left and I am happy with the performance, so now I can deploy my model. However, before that, I want to do a final retrain to make use of all my data. I wonder if the number of trees obtained in cross-validation is going to be optimal when I train with the full dataset as well. My intuition says that I should use more than 150 trees as I have more data and I can overfit less.

Are there any sound decisions regarding the number of trees of the retrained model?

In my mind I have at least 3:

  • Use the full data to cross-validate and obtain the optimal number of trees (this is ok for sure but it is the slowest).
  • Use the same number of trees (this is the most conservative, we might be underfitting the data).
  • Use a heuristic, like final_trees = 150*100/60. I am very interested in a legit heuristic where I would not underfit and where I don't need to train models on cross-validation again.

Have you heard of any heuristic like that?

Note: this is not only specific to Xgboost, as any model with a parameter that controls regularization can also have the same issues.

Foi útil?

Solução

Unpopular opinion: Second quickest way to overfit (next to data-leakage) is hyper-parameter optimization.

Why? You are assuming you wont have covariate-shift, while in most of the cases you can bet on it. Hence optimising too much on train (available data) will be ruin.

The most reasonable assumption (that we have to make sure it stands) is that 60% of the data is representative of the 40% and also approximately of the future unseen data hence all of the 150 trees should catch all of the neccesarry information.

Licenciado em: CC-BY-SA com atribuição
scroll top