In sequence models, is it possible to have training batches with different timesteps each to reduce the required padding per input sequence?

https://datascience.stackexchange.com/questions/85966

16-12-2020
|

Question

I want to train an LSTM model with variable length inputs. Specifically I want to use as little padding as possible while still using minibatches.

As far as I understand each batch requires a fixed number of timesteps for all inputs, necessitating padding. But different batches can have different numbers of timesteps for the inputs, so in each batch inputs only have to be padded to the length of the longest input-sequence in that same batch. This is what i want to implement.

What I need to do:

Dynamically create batches of a given size during training, the inputs within each batch are padded to the longest sequence within that same batch.
The training data is shuffled after each epoch, so that inputs appear in different batches across epochs and are padded differently.

Sadly my googling skills have failed me entirely. I can only find examples and resources on how to pad the entire input set to a fixed length, which is what i had been doing already and want to move away from. Some clues point me towards tensorflow's Dataset API, yet I can't find examples of how and why it would apply to the problem I am facing.

I'd appreciate any pointers to resources and ideally examples and tutorials on what I am trying to accomplish.

Solution

The answer to your needs is called "bucketing". It consists of creating batches of sequences with similar length, to minimize the needed padding.

In tensorflow, you can do it with tf.data.experimental.bucket_by_sequence_length. Take into account that previously it was in a different python package (tf.contrib.data.bucket_by_sequence_length), so the examples online may containt the outdated name.

To see some usage examples, you can check this jupyter notebook, or other answers in stackoverflow, or this tutorial.

OTHER TIPS

Found a solution, which is to pass a custom batch generator of type keras.utils.Sequence to the model.fit function (where one can write any logic to construct batches and to modify/augment training data) instead of passing the entire dataset in one go. Relevant code for reference:

# Must implement the __len__ function returning the number
# of batches in this dataset, and the __getitem__ function
# that returns a tuple (inputs, labels).
# Optionally, on_epoch_end() can be implemented which as the
# name suggest is called at the end of each epoch. Here one
# can e.g. shuffle the input data for the next epoch.

class BatchGenerator(keras.utils.Sequence):
    
    def __init__(self, inputs, labels, padding, batch_size):
        self.inputs = inputs
        self.labels = labels
        self.padding = padding
        self.batch_size = batch_size
        

    def __len__(self):
      return int(np.floor(len(self.inputs) / self.batch_size))

    def __getitem__(self, index):
      max_length = 0
      start_index = index*batch_size
      end_index = start_index+batch_size
      for i in range(start_index, end_index):
        l = len(self.inputs[i])
        if l>max_length:
          max_length = l
      
      out_x = np.empty([self.batch_size, max_length], dtype='int32')
      out_y = np.empty([self.batch_size, 1], dtype='float32')
      for i in range(self.batch_size):
        out_y[i] = self.labels[start_index+i]
        tweet = self.inputs[start_index+i]
        l = len(tweet)
        for j in range(l):
          out_x[i][j] = tweet[j]
        for j in range(l, max_length):
          out_x[i][j] = self.padding
      return out_x, out_y


# The model.fit function can then be called like this:

training_generator = BatchGenerator(tokens_train, y_train, pad, batch_size)
model.fit(training_generator, epochs=epochs)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange