over-fitting with good enough test accuracy

https://datascience.stackexchange.com/questions/67071

overfitting

21-10-2020
|

Domanda

Let's make things simple. Imagine an underdetermined linear system with $N$ samples and $p$ features $(N<p)$. Let's say I found one of the possible (among many) solutions of such systems and computed the test accuracy. My train's error is zero in this setting, but not the test's accuracy. Hence, overfitting. Despite that, assume that the test error rate is considered to be good from the perspective of the field experts. Let's also assume that some checks guaranteed that this rather good error rate is not due to test/train separation either.

Is it true to argue that no effective learning has taken place here?
If so, can one conclude from this that a better model with better accuracy in test exists that does not overfit?

Soluzione

There's probably a bit of interpretation involved in the question, here is my take on it.

For the sake of clarity let me start with my definition of the concept: overfitting is when a model takes into account patterns which are present in the training data by chance, i.e. the model assumes that these patterns are characteristic of the distribution even though they're not.

Is it true to argue that no effective learning has taken place here?

No it would be wrong to say that: the fact that a model overfits does not mean that no effective learning happened at all. In fact it's often the case that a model successfully acquires the patterns that it's supposed to capture (let's call this effective learning), but also acquires patterns that it shouldn't (overfitting). With any complex data it would even be very rare that no overfitting happens at all, and in fact very often it's hard to say exactly where is the line between effective learning and overfitting.

However people usually talk about overfitting when it's actually in excess, that is when the model generalizes too much on the "chance patterns" and not enough on the patterns which actually characterize the distribution. A performance which is significantly lower on the test set than the training set is a typical sign of such "excess overfit". So in this sense (excessive) overfitting is when the "chance patterns" cause the model to be suboptimal. But even a suboptimal model may have learned some relevant patterns.

If so, can one conclude from this that a better model with better accuracy in test exists that does not overfit?

Not really: as said above overfitting can make the model perform poorly, so naturally it's usually the case that a non-overfit model performs better than an overfit one. However there's no guarantee that a better performance can be reached just by getting rid of the overfitting: as an extreme example, if the features are mostly random and/or not correlated with the response variables then it's very likely that model will overfit, but the performance would be terrible without overfitting anyway.

In the case described by OP I would say that it's always worth trying to avoid (excess) overfitting: first of course because it can only improve performance, but also because at a more general level it means that the model is not very reliable. If later the model is applied "in production" to a large set of instances which happen not to have this particular "chance pattern" which was in the training data, the model is going to go very wrong and it will be too late to detect it.

Altri suggerimenti

Is it true to argue that no effective learning has taken place here?

On the contrary, if test and train errors are sufficiently low for field expert, it is safe to say that the model was able to learn from the data to some extent.

If so, can one conclude from this that a better model with better accuracy in test exists that does not overfit?

Given the statement of your problem, there is not enough information to conclude on this I think.

Your point might be better understood if you considered using a validation set, or better, cross-validation.

Your validation error would give you insight on how much overfitting takes place, and whether the model effectively learn from your data in different distributions between train and test sets.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a datascience.stackexchange