Is not having overfitting more important than overall score (F1: 80-60-40% or 43-40-40)?

https://datascience.stackexchange.com/questions/64533

overfitting

20-10-2020
|

Question

I've been trying to model a dataset using various classifiers. The response is highly imbalanced (binary) and I have both numerical and categorical variables, so I applied SMOTENC and Random oversampling methods on Training set. In addition, I used a Validation set to tune the models parameters by GridSearchCV(). As both precision and recall were important for me, I used f1 to find the best model.

I should note that I selected these three subsets by cluster analysis and extracting samples by stratified train_test_split() from each cluster; so I have more confident that the subsets have more similarity.

Due to complex nature of Decision Tree and Random Forest or boosting techniques, I usually get high fitting (high f1 score) on Training set, relatively high on Validation set, but moderate to low on Test set.

The general sign for overfitting is the high difference between Training and Test sets (or between Validation and Test sets in my problem); but I am confused how to select the best model in following cases:

Case A: Training fit is very high; but Validation and Test sets fit are low but close to each other

Case B: Training, Validation, and Test fits are similar; but much lower than Case A.

                            F1 Score
   Model             Train     Val    Test
----------------------------------------  
A: SVC                80.1     60.3    37.5   
B: MLPClassifier:     43.2     40.0    39.1

I know that Case A might be the best model,however there is no guarantee that is produces similar result for new data, but which model do you pick with regards to overfitting? (assume that precision and recalls for both models are similar)

Solution

The best model in your case is B not just because it scored higher on the test set, but because it showed very little sign of overfitting. By not overfitting as much and being more consistent in its scores (train/val/test), you know what you're getting from the model. If you try it tomorrow on a secondary test set, you'd expect similar results.

Model A on the other hand is very inconsistent. If you evaluated this model on the secondary test set, you can't tell how much to expect. It could be higher, it could be lower... Generally speaking, if you've overfit as much on the validation set, it's a good indication to start over (re-split the dataset into train/val/test randomly, preprocess, fit the model). This time try less hyperparameter options though.

From the numbers I see even though model B currently is better than A, A might have the capacity to be better than B, if properly trained. I'd suggest re-doing model A's training from scratch, but with more regularization and less hyperparameter-tuning.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange