Question

I was wondering what possible reasons there could be for a huge jump in loss for only one epoch during training. I am getting a result like...

Epoch 1/10
2020-05-13 18:42:19.436235: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 30910 of 40000
2020-05-13 18:42:22.360274: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
200/200 [==============================] - 173s 863ms/step - loss: 0.1844 - val_loss: 0.4250
Epoch 2/10
200/200 [==============================] - 167s 833ms/step - loss: 80890.9766 - val_loss: 0.5157
Epoch 3/10
200/200 [==============================] - 166s 830ms/step - loss: 0.0549 - val_loss: 0.2966
Epoch 4/10
200/200 [==============================] - 170s 849ms/step - loss: 0.0488 - val_loss: 0.2708

Which strikes me as very odd. This is a 3-lay LSTM network using Keras.

I graphed the data after normalization, but nothing out of the ordinary appears.

Graph of dataset after normalization

Here is the code I used to create the network...

# create new more complicated network
multi_step_model = tf.keras.models.Sequential()
multi_step_model.add(tf.keras.layers.LSTM(32,
                                      return_sequences=True,
                                       input_shape=        
[past_history,len(features_considered)]))
multi_step_model.add(tf.keras.layers.LSTM(16, activation='relu'))
# output layers are same as prediction count
multi_step_model.add(tf.keras.layers.Dense(future_target))
multi_step_model.compile(optimizer=tf.keras.optimizers.RMSprop(clipvalue=1.0), loss='mae')

# run training
multi_step_history = multi_step_model.fit(train_data_multi, epochs=EPOCHS,
                                      steps_per_epoch=EVALUATION_INTERVAL,
                                      validation_data=val_data_multi,
                                      validation_steps=50)
Was it helpful?

Solution

I cannot say for sure, but it seems that it might be an error in the printout. The validation loss doesn't seem to spike, so it seems that the loss may not actually be what is printed.

I can only advise to run training multiple times and see if this happens again. If yes, try toying with the learning rate. There is a small chance that the learning rate could be too high making the loss diverge for one epoch.

Based on your code, it seems that your optimizer clips gradient with an absolute value larger than 1. In a scenario where most of your gradients are larger, this could lead to your model being optimized in the "wrong direction". This could lead to the loss being so large for some epochs.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top