Is this over-fitting or something else?
-
11-12-2020 - |
문제
I recently put together an entry for the House Prices Kaggle competition for beginners. I decided to try my hand at understanding and using XGBoost.
I split Kaggle's 'training' data into 'training' and 'testing'. Then I fit and tuned my model on the new training data using KFold CV and got a score with scikit's cross_val_score
using a KFold with shuffle.
the average score on the training set with this cross validation was 0.0168 (mean squared log error).
Next, with the fully tuned model, I check its performance on the never before seen 'test' set (not the final test set for the Kaggle leader board). The score is identical after rounding.
So, I pat myself on the back because I've avoided over-fitting... or so I thought. When I made my submission to the competition, my score became 0.1359, which is a massive drop in performance. It amounts to being a solid 25 grand wrong on my house price predictons.
What could be causing this, if not overfitting?
Here is the link to my notebook, if it helps: https://www.kaggle.com/wesleyneill/house-prices-walk-through-with-xgboost
해결책
I'm not an avid Kaggler, but I do remember a case where the evaluation in time related data was randomly picked (which favored nearest neighbor-approaches, since exact duplicates could exist).
I'm not sure whether there are clues on the evaluation data this time (perhaps you can tell). But a possible overfit could be time related.
If the test set is just a random subsample of the test/train part, and the evaluation part is not randomly sampled, but for instance a holdout of the year 2011, you can still learn rules specific for the time dimension and not find them in test.
A possible way of tackling that would be to resample the test set accordingly.
다른 팁
You've followed the right process well. (Well it's possible there's an error, like not randomly sampling the test set.)
I think the issue is simply that you have nevertheless overfit. The Kaggle held-out test set may not be, due to chance, that like the provided training data. There's not a lot you can do except favor low-variance models more than low-bias models in your selection process.