Final Model fitting - subset vs entire training data

https://datascience.stackexchange.com/questions/76604

12-12-2020
|

Pergunta

If I used a subset of the entire available training data for model tuning and hyperparamater selection, should I fit the final model to the subset training dataset or the entire available training data? For example, if I have 1M samples available and I took a 100K random samples as a test holdout and 200K random samples as a training dataset for model tuning, should the tuned hyperparamaters used to fit the final model on the 1) 200K training dataset, or 2) 900K available data (excluding test holdout)? In other words, can the hyperparameters by generalized for the entire population?

I am assuming that both the holdout and training datasets are selected randomly and follow the class distribution in the original data.

Solução

The general machine learning process is this:

Split your data into two parts, training and test. So in your example I would take 100k for test and 900k for training (don't know why you say only take 200k in your question but I digress). With the 900k training we perform hyper-parameter tuning. This can be done by splitting training into training and validation say 800k/100k or better yet we could do this using k-fold cross validation.

Once you have chosen the optimal hyper-parameters in this manner you evaluate their performance on the test set. The whole point of this process is simply to evaluate the algorithms performance, and from that select and algorithm. That is the only reason for a train/validation/test split. (As a note, this process can be even further improved by using something called nested cross-validation but I will not go into the details).

After you have selected your algorithm and determined its performance (error rate), you take your whole data set (1 million records) and perform hyper-parameter selection on that, either using a single split or by k-fold cross-validation. You no longer need the test set because you have already determined the models error rate in the previous step.

Once you have selected the best hyper-parameters in the previous step you apply them to the entire data set (the 1 million records) and build the model.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange