Encoder-Decoder LSTM for Trajectory Prediction

https://datascience.stackexchange.com/questions/86615

17-12-2020
|

Question

I need to use encoder-decoder structure to predict 2D trajectories. As almost all available tutorials are related to NLP -with sparse vectors-, I couldn't be sure about how to adapt the solutions to a continuous data.

In addition to my ignorance in seqence-to-sequence models, embedding process for words confused me more. I have a dataset that consists of 3,000,000 samples each having x-y coordinates (-1, 1) with 125 observations, which means the shape of each sample is (125, 2). I thought I could think of this as 125 words with 2 dimensional already embedded words, but the encoder and the decoder in this Keras Tutorial expect 3D arrays as (num_pairs, max_english_sentence_length, num_english_characters).

I doubt I need to train each sample (125, 2) separately with this model, as the way Google's search bar does with only one word written.

As far as I understood, an encoder is many-to-one type model and a decoder is one-to-many type model. I need to get a memory state c and a hiddenstate h as vectors(?). Then I should use those vectors as input to decoder and extract predictions in the shape of (x,y) as many as I determine as encoder output.

I'd be so thankful if someone could give an example of an encoder-decoder LSTM architecture over the shape of my dataset, especially in terms of dimensions required for encoder-decoder inputs and outputs, particulary on Keras model if possible.

Solution

There are multiple questions in your description:

The input to LSTMs are normally continuous representations. In NLP, you normally embed discrete elements as vectors in a continuous representation space and then you pass these vectors to an LSTM. You already have continuous representations, so you just pass your 2-dimensional vectors as input to the LSTM.
In NLP neural networks, like most neural networks, the training happens passing as input not one sample, but a minibatch of N samples. This N most of the times is chosen as the maximum number that makes the model, data and intermediate computations fit in the GPU memory.
The 3D array expected as input by the LSTM has shape [N, timesteps, feature]. In your case, this would be [N, 125, 2].
I think you don't need an encoder-decoder architecture, only the encoder. Therefore, a single LSTM would suffice. You would train it to receive any number of input elements and predict the next one. If you want more predictions ahead, you can feed the model's own predictions as input, autoregressively. To find an analogy in the NLP world, your model would be a language model, which receives words (or letters) and generate the following word.

P.S.: I may be wrong about you needing only the encoder instead of an encoder-decoder architecture, as I don't know the specific nature of the predictions you want to make.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange