Avoiding Overfitting with a large LSTM net on a small amount of data

https://datascience.stackexchange.com/questions/80495

keras
lstm

13-12-2020
|

Question

I'm reposting this question from AI.SE here as I think it was maybe off-topic for AI.SE...

1. Context

I'm studying Health-Monitoring techniques, and I practice on the C-MAPSS dataset. The goal is to predict the Remaining Useful Life (RUL) of an engine given sensor measurements series. There's a wide litterature about the C-MAPSS dataset, including both classical (non-DL) ML techniques and DL-based approaches. A few years ago, LSTM-based networks showed promising results (see Long Short-Term Memory Network for Remaining Useful Life estimation, Zheng et al, 2017), and I'm trying to reproduce these results.

The C-MAPSS dataset contains a low amount of data. The FD001 subset has for instance only 100 run-to-failure series. When I pre process it to get fixed-length time series, I can get up to ~20 000 framed series. In the article mentioned above using LSTM, they use two hidden LSTM layers with 64 units each, and two fully-connected layers with 8 neurons each (~55 000 parameters).

2. Problem

LSTMs induce a great number of parameter, so overfitting may be encountered when training such a network. I can use L1 or L2 regularization, dropouts, the net will still be largely oversized regarding to the dataset. Keeping the same architecture, I can't reach the scores and RMSE in the paper in the validation set, and overfitting is always here.

However, one thing that works is reducing the number of units of the LSTM layers. Expectedly, with only 24 units instead of 64 per layer, the net has much less parameters (~9000), and it presents no overfitting. The scores and RMSE are a bit worse than the one in the paper, but it's the best I can get so far. Although these results are fine for me, I'm curious about how it was possible for the authors of the paper to avoid overfitting on their LSTM(64,64) net.

3. Question

LSTM are great, but they induce a lot of parameters that hinder a correct learning on small dataset : I wonder if there is any method to tackle this specific issue. Would you have any advice on how to avoid overfitting with a LSTM-based net on a small dataset ?

4. Infos

I provide here below more infos about my net and results :

Network architecture

model = keras.models.Sequential([
    keras.layers.LSTM(24, return_sequences=True, kernel_regularizer=keras.regularizers.l1(0.01),
                      input_shape=input_shape),
    keras.layers.Dropout(0.2),
    keras.layers.LSTM(24, return_sequences=False, kernel_regularizer=keras.regularizers.l1(0.01)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(8, activation='relu', kernel_regularizer=keras.regularizers.l2()),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(8, activation='relu', kernel_regularizer=keras.regularizers.l2(), bias_regularizer=keras.regularizers.l2()),
    keras.layers.Dense(1, activation='relu')
])

Scores (Validation set)

Paper: Score = 16.14 ; RMSE = 338
My LSTM(64, 64): Score = 26.47; RMSE = 3585 (overfits)
My LSTM(24, 24): Score = 16.82; RMSE = 515

Edit : Results for solution proposed by @hH1sG0n3

LSTM(64, 64) with recurrent_dropout=0.3 : Score = 16.36; RMSE = 545

Solution

You may want to check a couple of hyperparameters that it appears you are not testing for in your code above:

Gradient clipping: large updates to weights during training can cause a numerical overflow or underflow often referred to as “exploding gradients.”

# configure your optimizer with gradient norm clipping
opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)

Reccurent dropout: Dropout that is applied to the recurrent input signal of the units of the LSTM layer.

keras.layers.LSTM(24, kernel_regularizer=keras.regularizers.l1(0.01), ..., recurrent_dropout=0.3)

Stateful: Is it clear from the paper if the model retains its state with each training iteration. You can experiment with this as well.

keras.layers.LSTM(24, kernel_regularizer=keras.regularizers.l1(0.01), ..., stateful=True)

OTHER TIPS

So, the question asks on how to prevent overfitting, with particularly a small dataset.

Obviously, my first intuition says to reduce the number of layers (e.g. remove the second LSTM layer, for example), but this would obviously change the overall architecture of the model, such that it has fewer layers than the model described in the paper.

Other particularly obvious suggestion is to do some form of data augmentation, to artificially increase the number of samples using the dataset you currently have.

Have you also applied any preprocessing to the data (i.e scaled numerical values, etc.)? If not, this could also help.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange