Is there an inherent recency bias in deep learning?

https://datascience.stackexchange.com/questions/86014

17-12-2020
|

Question

When working with very large models within Deep Learning, training often takes long and requires small batch sizes due to memory restrictions.
Usually, we are left with a model checkpoint after training has commenced. I am wondering whether the exact time at which we take that checkpoint significantly factors in to the statistical properties of a model's outputs.

For example:
Within text generation, lets assume that just before we extract the checkpoint, the model learns statistically anomalous batches with longer sentences than the mean.
Would that result in our model generating longer sentences, overrepresenting that recent batch of anomalous texts?

As training batches are often randomly generated from the dataset, such unrepresentative batches may certainly occur, sometimes right before we save the checkpoint.
Has there been any research regarding such, potentially unwanted, recency bias in slower deep learning scenarios?

The only references I could find were intentionally trying to employ such biases, but I have not found any literature on unwanted recency bias.

Solution

Your question is very interesting, however I feel you are overlooking a key point in your reasoning:

You usually take a model checkpoint at the point that it performs best on the validation set. This means that the instance of the model you keep is inherently the most robust and generalizable version of the model that you have evaluated, thus suffering the least from recency bias.

Suppose though, you don't checkpoint the model but stop it at a point arbitrarily. Naturally, you'd think that the samples in the final batch influenced the current state of the model much more than the first batches of the epoch. In practice this would show up as regular overfitting, however, instead of recency bias.

Some ways to deal with this:

relatively small learning rate
equivalently regularization as parameter norm penalties (i.e. L1, L2, ...)
ensembling
other more specialized techniques such as SGDA

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange