Question

I've been trying to complete this regression task on Kaggle. As usual they gave a train.csv(with response variable) and a test.csv (without response variable) file for us to train the model and compute our predictions, respectively.

I further split the train.csv file into a train_set and test_set. I use this subsequent train_set to train a list of models which I will then shortlist to one model only based on 10-fold cross validation scores (RMSLE) and after hyperparameter tuning. Now I have one best model, which is Random Forest (with best hyperparameters) with an average RMSLE score of 0.55. At this point I have NOT touched the test_set.

Consequently, when I train the same exact model on train_set data, but evaluate its result on test_set (in order to avoid overfitting the hyperparameters I have tuned), it yields an RMSLE score of 0.54. This is when I get suspicious, because my score on test_set are slightly better than the average score of the train_set (test_set results are supposed to be slightly worse, since the model hasn't seen the test_set data, right?).

Finally, I proceed to submit my results using the same model but with the test.csv file (without response variable). But then Kaggle gave me an RMSLE score of 0.77, which is considerably worse than my cross validation scores and my test_set scores!

I am very frustrated and confused as to why this would happen, since I believe I've taken every measure to anticipate overfitting my model. Please give a detailed but simple explanation, I'm still a beginner so I might not understand overly technical terms.

Was it helpful?

Solution

This "train split" you named train_set and test_set are not guarantee to be clean or even balanced.

When your test set has better performance than your training set that might mean that you have data leakage (some examples in the test set are equal to the training set) or just mean your test set is slightly easier than the training set.

OTHER TIPS

They want to test your ability to generalise.

Test (holdout/leaderboard) set will Always have somewhat different Distribution (i.e. covariant shift) hence often you have a leaderboard shakeup. Also often People try to probe the test set to learn this Distribution and adjust the model accordingly.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top