Question

I got a cv log_loss of 0.3025410331400577 when using 4-fold cross-validation and my leaderboard (with 30% of test dataset) got 0.26514. I further did feature engineering and added some features to the model, which decreased my cv log_loss to 0.2946628055452142 but my leaderboard score increases to 0.30021.

With all other techniques used, my cv log_loss decreased but my leaderboard loss increased.
I used XGBoostClassifier model. I have removed all correlated features (corr > 0.8) also.
Usually we will be judging whether our model generalizes or not, based on cv score. But here, cv score is not reliable. What may be the reason of this?

And is it valid to judge my model performs better when my cv score decreases ?
If not, what are all the other techniques to judge my model?

Was it helpful?

Solution

I think there are a few things going on in this question so I will take them one at a time. Just to note there are multiple reasons you could be faced with the issues you are noticing so I'm just going to give some possible reasons that come to mind. Just to note by leaderboard is something like a Kaggle competition where data is held back for use as a blind test. Overall, I think more information is needed to have a stab at troubleshooting which I will explain as I go.

I got a cv log_loss of 0.3025410331400577 when using 4-fold cross-validation and my leaderboard (with 30% of test dataset) got 0.26514. I further did feature engineering and added some features to the model, which decreased my cv log_loss to 0.2946628055452142 but my leaderboard score increases to 0.30021.

Your approach to use cross-validation (cv) is good however 4-fold seems odd to me and also a little low. The standard would generally be 5 or 10-fold, see Cross Validation for a nice discussion about cv and some advantages of 10-fold. My thought is that if you have a proportion of outliers/misclassified data in on your training set your low choice of 4-fold could mean the outliers are present in all training sets so your model is trained on these misclassified cases. Perhaps test increasing the number of folds on your model's performance. Conversely, cv does depend on sample size so if you are restricted on sample size this would necessitate reducing the number of folds or avoiding cv completely. The issue with not doing this is explained very nicely here but in short, each of your folds for training should have the same distribution as the test set so if you think this may not be the case avoid cross validation or drop the value of k. Here is a really nice discussion about k-fold cross validation and over-fitting but the bottom line is it can happen. This is also without being sure of what supervised machine learning technique you are using however which would also play a role as some machine learning techniques work better with larger training sets e.g. DNNs.

With all other techniques used, my cv log_loss decreased but my leaderboard loss increased. I used XGBoostClassifier model. I have removed all correlated features (corr > 0.8) also.

Your choice to use a gradient boosting technique makes me think you are doing logistic regression and as such your choice to remove one of the highly correlated variables is, in general, a good idea; removing the extra influence of the variables, here is a good discussion.

The fact that you are using cv to compare your approaches is good, this is what it is essentially for but note that as you do this you risk over fitting to the training data similar to doing this with just a training set. An important step I think you have missed is to split your data and create your own test set that you do not touch until hyperparameter tuning is complete to use as a blind test. This should give you equivalent results to the leaderboard test set if you have big enough sample size, your test/train split has equivalent distributions and lastly, if the test set from the leaderboard is, in fact, the same as the data you are using for training.

And is it valid to judge my model performs better when my cv score decreases ?

Assuming a high score is good (i.e. we want to maximise for accuracy etc not minimise some loss function) no, this is definitely not the way to go, I hope I have convinced you of possible reasons why here.

EDIT Another possible reason I thought of after discovering the dataset is quite small is that the issue could be caused by data leakage if an upsampling technique was used. Here is a very good discussion but basically if you upsample from a pool of training data before performing cv then your model could learn some traits of other real training data that was split into other cvs. This would cause your cv performance to be far higher than it really should. The way around this is t upsample within each cv.

To summarise, I think your question is generally as to why a machine learning approach does not perform as well on test data as it does on train to which the answer is simple; it never will. I like to think of the analogy of studying for an exam, you would do brilliantly if the exact questions you studied came up but this rarely if ever happens! If something similar to what you studied came up, you may do well but probably not quite as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top