Why Is Overfitting Bad in Machine Learning?

https://datascience.stackexchange.com/questions/61

16-10-2019
|

Question

Logic often states that by overfitting a model, its capacity to generalize is limited, though this might only mean that overfitting stops a model from improving after a certain complexity. Does overfitting cause models to become worse regardless of the complexity of data, and if so, why is this the case?

Related: Followup to the question above, "When is a Model Underfitted?"

Solution

Overfitting is empirically bad. Suppose you have a data set which you split in two, test and training. An overfitted model is one that performs much worse on the test dataset than on training dataset. It is often observed that models like that also in general perform worse on additional (new) test datasets than models which are not overfitted.

One way to understand that intuitively is that a model may use some relevant parts of the data (signal) and some irrelevant parts (noise). An overfitted model uses more of the noise, which increases its performance in the case of known noise (training data) and decreases its performance in the case of novel noise (test data). The difference in performance between training and test data indicates how much noise the model picks up; and picking up noise directly translates into worse performance on test data (including future data).

Summary: overfitting is bad by definition, this has not much to do with either complexity or ability to generalize, but rather has to do with mistaking noise for signal.

P.S. On the "ability to generalize" part of the question, it is very possible to have a model which has inherently limited ability to generalize due to the structure of the model (for example linear SVM, ...) but is still prone to overfitting. In a sense overfitting is just one way that generalization may fail.

OTHER TIPS

Overfitting, in a nutshell, means take into account too much information from your data and/or prior knowledge, and use it in a model. To make it more straightforward, consider the following example: you're hired by some scientists to provide them with a model to predict the growth of some kind of plants. The scientists have given you information collected from their work with such plants throughout a whole year, and they shall continuously give you information on the future development of their plantation.

So, you run through the data received, and build up a model out of it. Now suppose that, in your model, you considered just as many characteristics as possible to always find out the exact behavior of the plants you saw in the initial dataset. Now, as the production continues, you'll always take into account those characteristics, and will produce very fine-grained results. However, if the plantation eventually suffer from some seasonal change, the results you will receive may fit your model in such a way that your predictions will begin to fail (either saying that the growth will slow down, while it shall actually speed up, or the opposite).

Apart from being unable to detect such small variations, and to usually classify your entries incorrectly, the fine-grain on the model, i.e., the great amount of variables, may cause the processing to be too costly. Now, imagine that your data is already complex. Overfitting your model to the data not only will make the classification/evaluation very complex, but will most probably make you error the prediction over the slightest variation you may have on the input.

Edit: This might as well be of some use, perhaps adding dynamicity to the above explanation :D

Roughly speaking, over-fitting typically occurs when the ratio

enter image description here

is too high.

Think of over-fitting as a situation where your model learn the training data by heart instead of learning the big pictures which prevent it from being able to generalized to the test data: this happens when the model is too complex with respect to the size of the training data, that is to say when the size of the training data is to small in comparison with the model complexity.

Examples:

if your data is in two dimensions, you have 10000 points in the training set and the model is a line, you are likely to under-fit.
if your data is in two dimensions, you have 10 points in the training set and the model is 100-degree polynomial, you are likely to over-fit.

enter image description here

From a theoretical standpoint, the amount of data you need to properly train your model is a crucial yet far-to-be-answered question in machine learning. One such approach to answer this question is the VC dimension. Another is the bias-variance tradeoff.

From an empirical standpoint, people typically plot the training error and the test error on the same plot and make sure that they don't reduce the training error at the expense of the test error:

enter image description here

I would advise to watch Coursera' Machine Learning course, section "10: Advice for applying Machine Learning".

(PS: please go here to ask for TeX support on this SE.)

No one seems to have posted the XKCD overfitting comic yet.

enter image description here

That's because something called bias-variance dilema. The overfitted model means that we will have more complex decision boundary if we give more variance on model. The thing is, not only too simple models but also complex models are likely to have dis-classified result on unseen data. Consequently, over-fitted model is not good as under-fitted model. That's why overfitting is bad and we need to fit the model somewhere in the middle.

What got me to understand the problem about overfitting was by imagining what the most overfit model possible would be. Essentially, it would be a simple look-up table.

You tell the model what attributes each piece of data has and it simply remembers it and does nothing more with it. If you give it a piece of data that it's seen before, it looks it up and simply regurgitates what you told it earlier. If you give it data it hasn't seen before, the outcome is unpredictable or random. But the point of machine learning isn't to tell you what happened, it's to understand the patterns and use those patterns to predict what's going on.

So think of a decision tree. If you keep growing your decision tree bigger and bigger, eventually you'll wind up with a tree in which every leaf node is based on exactly one data point. You've just found a backdoor way of creating a look-up table.

In order to generalize your results to figure out what might happen in the future, you must create a model that generalizes what's going on in your training set. Overfit models do a great job of describing the data you already have, but descriptive models are not necessarily predictive models.

The No Free Lunch Theorem says that no model can outperform any other model on the set of all possible instances. If you want to predict what will come next in the sequence of numbers "2, 4, 16, 32" you can't build a model more accurate than any other if you don't make the assumption that there's an underlying pattern. A model that's overfit isn't really evaluating the patterns - it's simply modeling what it knows is possible and giving you the observations. You get predictive power by assuming that there is some underlying function and that if you can determine what that function is, you can predict the outcome of events. But if there really is no pattern, then you're out of luck and all you can hope for is a look-up table to tell you what you know is possible.

You are erroneously conflating two different entities: (1) bias-variance and (2) model complexity.

(1) Over-fitting is bad in machine learning because it is impossible to collect a truly unbiased sample of population of any data. The over-fitted model results in parameters that are biased to the sample instead of properly estimating the parameters for the entire population. This means there will remain a difference between the estimated parameters $\hat{\phi}$ and the optimal parameters $\phi^{*}$, regardless of the number of training epochs $n$.

$|\phi^{*} - \hat{\phi}| \rightarrow e_{\phi} \mbox{ as }n\rightarrow \infty$, where $e_{\phi}$ is some bounding value

(2) Model complexity is in simplistic terms the number of parameters in $\phi$. If the model complexity is low, then there will remain a regression error regardless of the number of training epochs, even when $\hat{\phi}$ is approximately equal to $\phi^{*}$. Simplest example would be learning to fit a line (y=mx+c), where $\phi = \{m,c\}$ to data on a curve (quadratic polynomial).

$E[|y-M(\hat{\phi})|] \rightarrow e_{M} \mbox{ as } n \rightarrow \infty$, where $e_{M}$ is some regression fit error bounding value

Summary: Yes, both sample bias and model complexity contribute to the 'quality' of the learnt model, but they don't directly affect each other. If you have biased data, then regardless of having the correct number of parameters and infinite training, the final learnt model would have error. Similarly, if you had fewer than the required number of parameters, then regardless of perfectly unbiased sampling and infinite training, the final learnt model would have error.

There have been a lot of good explanations about overfitting. Here are my thoughts. Overfitting happens when your variance is too high and bias is too low.

Let's say you have training data with you, which you divide into N parts. Now, if you train a model on each of the datasets, you will have N models. Now find the mean model and then use the variance formula to compute how much each model varies from the mean. For overfitted models, this variance will be really high. This is because, each model would have estimated parameters which are very specific to the small dataset that we fed to it. Similarly, if you take the mean model and then find how much it is different from the original model that would have given the best accuracy, it wouldn't be very different at all. This signifies low bias.

To find whether your model has overfitted or not, you could construct the plots mentioned in the previous posts.

Finally, to avoid overfitting you could regularize the model or use cross validation.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange