Disparity between training and testing errors with deep learning: the bias-variance tradeoff and model selection

https://datascience.stackexchange.com/questions/75003

11-12-2020
|

Pergunta

I am developing a convolutional neural network and have a dataset with 13,000 datapoints that is split 80%/10%/10% train/validation/test. In tuning the model architecture, I found the following, after averaging results over several runs with different random seeds:

3 conv layers: training MAE = 0.255, val MAE = 0.340
4 conv layers: training MAE = 0.232, val MAE = 0.337
5 conv layers: training MAE = 0.172, val MAE = 0.328.

Normally, I'd pick the model with the best validation MAE (the trends are the same for the testing MAE, for what it's worth). However, the architecture with the best validation MAE also has the largest difference between training and validation MAE. Why is what I'd normally think of as overfitting giving better results? Would you also go with 5 convolutional layers here, or are there concerns with a large difference in training and validation/testing performance?

On what I imagine is a related note, I am familiar with the article "Reconciling modern machine-learning practice and the classical bias–variance trade-off" in PNAS, which has the thought-provoking image below. Is this something that's actually observed in practice -- that you can have minimal training error but good out-of-sample, generalizable performance, as shown in subpanel B?

Solução

Your question is, what model is better between one that seems more overfitted (larger difference between train and eval set) but it has also higher scores or one that has less variance between train and eval set but at the same time it has worst results. Everything assuming that you have done a correct train test split and there is no data leakage and distributions keep the same in every split (this is important to check).

There was a discussion about this some time ago. The answer seems to be kind of subjective since the quantitative analysis has been performed

Normally there is the following trade-offs:

Complexity: Occams razor and complexity vs interpretability. In your case, both models are almost with the same complexity (it's not a linear regression against DL, just a couple layers more) and the interpretability stays the same,
Generalization. You want your model to behave in the best possible way in production, an overfitted model in train seems to have more probable cause to fail due to a change of distribution in production.

You only have 3 data points so it's hard to say what it will be best. My suggestions will be that:

Add some more layers (6,7,8) just to see when your test results start to go down (you can still overfit much more) and then visualize the data and keeping both concepts defined before choosing what are the best architectures for your model
Investigate with more parameters (adding one more layer seems to be a high difference hyperpameter), like learning rate, layer size, activations functions and so on...
Consider using one of the famous architectures for your problem, they are developed in every framework and tested by a lot of people, they are there because the seem to be the best at their task, give them ago. There has been already a lot of electricity wasted in deep learning hyperparameter tunning.

Outras dicas

It is common in applied machine learning to have the model with the lowest generalization error, as measured by score on validation data, also have the biggest delta from the score on the training data.

There is nothing inherently wrong with overfitting, it depends on the goal of the project. The typical goal of applied machine learning is high predictive ability on unseen data, aka low generalization error. It is okay if the model "memorizes" more of the training data if that helps the model improve generalization.

Given the trend that as the number of layers increases generalization error decreases, the performance of the model might improve if the number of layers continues to increase.

I glanced over the paper, it seems very interesting. It would be truly fascinating to see if this phase transition from pattern recognition too interpolation of the data actually holds true as the authors claim, though I am a bit sceptical.

However, I think you are far away from interpolating your data. Your models are not very complex yet. So for your case we should be able to apply the classical principals of model selection.

If you look at the image you provided, figure A has an arrow that reads "sweet spot".

You see how before it the gap between the generalization error and the training error increases, but nonetheless the generalization error still decreases. You are around that spot.

Probably you are still before that spot. So I would recommend to still increase the complexity of your model and add layers.

Better add some dropout and pooling layers once you notice true overfitting. Usually neural networks become more powerful the deeper they are, since this allows them to learn richer representations of the data. Pooling helps keep the complexity of your model down, and therefore allows you to spend that complexity on the depth of your model. Dropout helps with generalization because it has regularizing properties, as it forces the network to rely on multiple neuron connections for a decision. This makes it less susceptible to being stimulated by a sinlge pattern of your input data. You could interpret dropout as multiple neural networks fusioned into one, so it behaves similar to an ensemble of neural networks.

Personally I rely on the classical theory of model selection and it has served me well. Since your dataset is not so large, you should maybe employ something like five-fold-crossvalidation to be surer about your results.

Hope I could provide some insights.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange