Can overfitting occur even with validation loss still dropping?

https://datascience.stackexchange.com/questions/15242

16-10-2019
|

Question

I have a convolutional + LSTM model in Keras, similar to this (ref 1), that I am using for a Kaggle contest. Architecture is shown below. I have trained it on my labeled set of 11000 samples (two classes, initial prevalence is ~9:1, so I upsampled the 1's to about a 1/1 ratio) for 50 epochs with 20% validation split.I was getting blatant overfitting for a while but I thought it got it under control with noise and dropout layers.

Model looked like it was training wonderfully, at the end scored 91% on the entirety of the training set, but upon testing on the test data set, absolute garbage.

Notice: the validation accuracy is higher than the training accuracy. This is the opposite of "typical" overfitting.

My intuition is, given the small-ish validation split, the model is still managing to fit too strongly to the input set and losing generalization. The other clue is that val_acc is greater than acc, that seems fishy. Is that the most likely scenario here?

If this is overfitting, would increasing the validation split mitigate this at all, or am I going to run into the same issue, since on average, each sample will see half the total epochs still?

The model:

Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
convolution1d_19 (Convolution1D) (None, None, 64)      8256        convolution1d_input_16[0][0]     
____________________________________________________________________________________________________
maxpooling1d_18 (MaxPooling1D)   (None, None, 64)      0           convolution1d_19[0][0]           
____________________________________________________________________________________________________
batchnormalization_8 (BatchNormal(None, None, 64)      128         maxpooling1d_18[0][0]            
____________________________________________________________________________________________________
gaussiannoise_5 (GaussianNoise)  (None, None, 64)      0           batchnormalization_8[0][0]       
____________________________________________________________________________________________________
lstm_16 (LSTM)                   (None, 64)            33024       gaussiannoise_5[0][0]            
____________________________________________________________________________________________________
dropout_9 (Dropout)              (None, 64)            0           lstm_16[0][0]                    
____________________________________________________________________________________________________
batchnormalization_9 (BatchNormal(None, 64)            128         dropout_9[0][0]                  
____________________________________________________________________________________________________
dense_23 (Dense)                 (None, 64)            4160        batchnormalization_9[0][0]       
____________________________________________________________________________________________________
dropout_10 (Dropout)             (None, 64)            0           dense_23[0][0]                   
____________________________________________________________________________________________________
dense_24 (Dense)                 (None, 2)             130         dropout_10[0][0]                 
====================================================================================================
Total params: 45826

Here is the call to fit the model (class weight is typically around 1:1 since I upsampled the input):

class_weight= {0:1./(1-ones_rate), 1:1./ones_rate} # automatically balance based on class occurence
m2.fit(X_train, y_train, nb_epoch=50, batch_size=64, shuffle=True, class_weight=class_weight, validation_split=0.2 )

SE has some silly rule that I can post no more than 2 links until my score is higher, so here is the example in case you are interested: Ref 1: machinelearningmastery DOT com SLASH sequence-classification-lstm-recurrent-neural-networks-python-keras

Solution

I am not sure if the validation set is balanced or not. You have a severe data imbalance problem. If you sample equally and randomly from each class to train your network, and then a percentage of what you sampled is used to validate your network , this means that you train and validate using balanced data set. In the testing you used imbalanced database. This means that your validation and testing sets are not equivalent. In such case you may have high validation accuracy and low testing accuracy. Please find this reference that talks mainly about data imbalance problem for DNN , you can check how they sample to do the training, validation and testing https://pdfs.semanticscholar.org/69a6/8f9cf874c69e2232f47808016c2736b90c35.pdf

OTHER TIPS

If your training loss goes under you validation loss, you are overfitting, even if validation is still dropping.

It is the sign that your network is learning patterns in the train set that are not applicable in the validation one

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange