Question

The parameters of the network are changed to minimize the loss on the mini-batch, but usually the loss on the mini-batch is just the (weighted) sum of losses on each datum individually. Loosely, I would represent this as $$ dT = \frac{1}{\text{batch_size}} \sum_{i \in \text{batch}} dT_i$$

Where $dT$ is the update of the net parameters for the batch and $dT_i$ is only for one training example. Why can't $dT$ be calculated 'on-line' then, where the only RAM needed is on the partial sum for $dT$ and whichever $dT_i$ you are working on at that moment?

Was it helpful?

Solution

Something similar to what you describe is frequently used in some domains and it is called gradient accumulation. In layman terms, it consists of computing the gradients for several batches without updating the weight and, after N batches, you aggregate the gradients and apply the weight update.

This certainly allows using batch sizes greater than the size of the GPU ram.

The limitation to this is that at least one training sample must fit in the GPU memory. If this is not the case, other techniques like gradient checkpointing can be used.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top