Question

Basically, I'm new to the data science field, and I'm getting a little bit of confusion about overfitting and underfitting.

Are overfitting and underfitting is totally depending upon the number of datasets or the behavior of data ?

Can anyone explain the term of overfitting and underfitting and how to deal with this kind of problem?

Was it helpful?

Solution

Under/overfitting depends on two things: the amount of data in your dataset and the complexity of your model.

To identify when each of these is happening, you will have to split the data you have into two parts: training data and test data. You then train your model only on the training data, and then evaluate its performance (e.g. calculate its accuracy or any other metric you are interested in) on the training data and test data.

If your model performs well on your training data (e.g. you get a very good accuracy while training a model), but cannot make good predictions on your test data, then we say that the model is overfitting. What this means is that the model has memorized the training data instead of learning the patterns in it. As a result, it cannot generalize and make good predictions on data it hasn't seen before (e.g. the test data).

This could be fixed by either reducing the complexity of the model (e.g. if it is a neural network then reduce the number of layers) or by increasing the amount of data (e.g. collecting more data, or using data augmentation techniques)

If your model doesn't perform well on both training and test data, then we say it is underfitting. This means that the model is not complex enough to learn the pattern in the training data. This can be fixed by using a more complex model (i.e. a model with more parameters).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top