Is this over-fitting or something else?

https://datascience.stackexchange.com/questions/74083

11-12-2020
|

質問

I recently put together an entry for the House Prices Kaggle competition for beginners. I decided to try my hand at understanding and using XGBoost.

I split Kaggle's 'training' data into 'training' and 'testing'. Then I fit and tuned my model on the new training data using KFold CV and got a score with scikit's cross_val_score using a KFold with shuffle.

the average score on the training set with this cross validation was 0.0168 (mean squared log error).

Next, with the fully tuned model, I check its performance on the never before seen 'test' set (not the final test set for the Kaggle leader board). The score is identical after rounding.

So, I pat myself on the back because I've avoided over-fitting... or so I thought. When I made my submission to the competition, my score became 0.1359, which is a massive drop in performance. It amounts to being a solid 25 grand wrong on my house price predictons.

What could be causing this, if not overfitting?

Here is the link to my notebook, if it helps: https://www.kaggle.com/wesleyneill/house-prices-walk-through-with-xgboost

解決

I'm not an avid Kaggler, but I do remember a case where the evaluation in time related data was randomly picked (which favored nearest neighbor-approaches, since exact duplicates could exist).

I'm not sure whether there are clues on the evaluation data this time (perhaps you can tell). But a possible overfit could be time related.

If the test set is just a random subsample of the test/train part, and the evaluation part is not randomly sampled, but for instance a holdout of the year 2011, you can still learn rules specific for the time dimension and not find them in test.

A possible way of tackling that would be to resample the test set accordingly.

他のヒント

You've followed the right process well. (Well it's possible there's an error, like not randomly sampling the test set.)

I think the issue is simply that you have nevertheless overfit. The Kaggle held-out test set may not be, due to chance, that like the provided training data. There's not a lot you can do except favor low-variance models more than low-bias models in your selection process.

ライセンス： CC-BY-SA と帰属

所属していません datascience.stackexchange