Significant overfitting with CV

https://datascience.stackexchange.com/questions/29964

31-10-2019
|

Question

I working on a binary classification task. The dataset is quite small ~1800 rows and ~60 columns. There are no duplicates in the rows. I am comparing different classifiers amongst the canonical ones: random forest, logistic regression, boosted tree and SVC. I am training the hyperparameters by a CV on 90% (train) with 10% held out to measure the generalization error (test). The dataset is slightly unbalances (1 to 3 ratio of classes) hence I used a stratified fold for all splits. I also use roc-auc as a metric for my CV.

I get the following results for roc-auc score and accuracy:

 DummyClassifier
Train
ROC-AUC score: 0.50000
Accuracy: 0.69705
Test
ROC-AUC score: 0.50000
Accuracy: 0.69545

 LogisticRegression
Train
ROC-AUC score: 0.88459
Accuracy: 0.78666
Test
ROC-AUC score: 0.72559
Accuracy: 0.69545

 RandomForestClassifier
Train
ROC-AUC score: 1.00000
Accuracy: 0.99695
Test
ROC-AUC score: 0.81748
Accuracy: 0.80455

 XGBClassifier
Train
ROC-AUC score: 1.00000
Accuracy: 0.99949
Test
ROC-AUC score: 0.80617
Accuracy: 0.79545

 SVC
Train
ROC-AUC score: 0.89900
Accuracy: 0.83248
Test
ROC-AUC score: 0.73515
Accuracy: 0.73182

There is always a significant gap between train and test scores. I am clearly overfitting. I guess it is a consequence of the low number of rows but I am not sure about what to do about that? Force the CV grid search for hyperparameters to a range with strong regularization?

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange