The parameters of the network are changed to minimize the loss on the mini-batch, but usually the loss on the mini-batch is just the (weighted) sum of losses on each datum individually. Loosely, I would represent this as $$ dT = \frac{1}{\text{batch_size}} \sum_{i \in \text{batch}} dT_i$$

Where $dT$ is the update of the net parameters for the batch and $dT_i$ is only for one training example. Why can't $dT$ be calculated 'on-line' then, where the only RAM needed is on the partial sum for $dT$ and whichever $dT_i$ you are working on at that moment?

有帮助吗?

解决方案

Something similar to what you describe is frequently used in some domains and it is called gradient accumulation. In layman terms, it consists of computing the gradients for several batches without updating the weight and, after N batches, you aggregate the gradients and apply the weight update.

This certainly allows using batch sizes greater than the size of the GPU ram.

The limitation to this is that at least one training sample must fit in the GPU memory. If this is not the case, other techniques like gradient checkpointing can be used.

许可以下: CC-BY-SA归因
scroll top