Question

I am on a binary classification problem with the AUC metrics. I did a random split 70%, 30% for training and test sets. My first attempts using random forest with default hyper-parameters gave me auc 0.85 on test set and 0.96 on training set. So, the model overfits. But the score of 0.85 is good enough for my business. I also did a 5-folds cross validation with the same model and same hyper-parameters and the test set results were consistently something between 0.84 and 0.86

My question is: can I believe on the score 0.85 and use this model in production?

Was it helpful?

Solution

Yes, if your 0.85 AUC is good enough for your use case this is a good enough model. The performance on the training set indicates how well your model knows the training set. This we don't really care about, it's just what the model tries to optimize. The performance on the test set is an indication on how well your model generalizes. This is what we care about, and your model gets to around 0.85 as an estimate for your generalization. Differences between training and testing are the norm and in this case it could be that you might get a better performance by adding stronger regularization but if 0.85 is good enough, go for it!

OTHER TIPS

My first attempts [...] gave me auc 0.85 on test set and 0.96 on training set. So, the model overfits.

This is not quite true.

See, (almost) each estimator will have a better prediction score on the training data than on the testing data. It doesn't mean each estimator overfit.

It is normal though to habe a better score on the training set, as the estimator is built on it, meaning its parameters are fitted thanks to it. However, your estimator can fit your training data more or less.

Let's take your Random-Forest example. If the depth is too high, you'll fit way to much to the training data : overfit. If the depth is not high enough, it will be hard to generalize to other data : you underfit.

  1. Underfitting : 0.96 on train set & 0.82 on test set
  2. Possible good fitting : 0.96 on train set & 0.89 on test set
  3. Overfitting : 0.96 on train set & 0.75 on test set

As a good data-scientist, you want your model to fit the data enough to generalize well but not too much not to overfit. To control how your model generalize, one uses cross-validation techniques. The value you get is pretty-much what you will obtain with new value ± the variance associated to this crossvalidation

PS: Using cross-validation too often on test data makes you, in a way, learning this data as you choose them to maximize your test score. It can lead to a form of overfitting for future new data.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top