Question

I am working on LSTMs and LSTM AutoEncoders, trying different types of architectures for multivariate time series data, using Keras.

Since it is not really practical to use relu in LSTM because of exploding gradients, I added a Dense layer following LSTM, so it is like:

model = Sequential()
model.add(LSTM(number_of_features, batch_input_shape = (batch_size, time_steps, number_of_features), return_sequences = True))
model.add(Dense(number_of_features))

What I want to know is:

  • Is this fully connected Dense layer connected to only the last step in LSTM? Or does it add a fully connected Dense layer for all time steps?

When I checked the number of parameters to be sure about this. Dense layer has number_of_features $\times$ (number_of_features + 1) parameters, which implies this Dense layer is applied to all time steps in LSTM network. This makes sense since I set return_sequences = True, but even when I set it to False, this does not change, which made me doubt my understanding.

So,

  • How does Dense work with LSTM with Return_Sequences?
  • What is its different from TimeDistributed layer?
  • Why changing return_sequences to False did not result in a reduction in number of parameters of Dense layer, from number_of_features $\times$ (number_of_features + 1), to (number_of_features + 1)?
Was it helpful?

Solution

I have been able to find an answer in Tensorflow Warrior's answer here.

  • In Keras, when an LSTM(return_sequences = True) layer is followed by Dense() layer, this is equivalent to LSTM(return_sequences = True) followed by TimeDistributed(Dense()).
  • When return_sequences is set to False, Dense is applied to the last time step only.
  • Number of parameters were same even when I set return_sequences = False because even though applied to all time steps, they shared the same parameters, that is after all what TimeDistributed() does.
Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top