BERT minimal batch size

https://datascience.stackexchange.com/questions/85973

hyperparameter
bert

16-12-2020
|

문제

Is there a minimum batch size for training/re-fining a BERT model on custom data?

Could you name any cases where a mini batch size between 1-8 would make sense?

Would a batch size of 1 make sense at all?

해결책

Small mini-batch size leads to a big variance in the gradients. In theory, with a sufficiently small learning rate, you can learn anything even with very small batches.

In practice, Transformers are known to work best with very large batches. You can simulate large batches by accumulating gradients from the mini-batches and only do the update once in several steps.

Also, when finetuning BERT, you might also think of fine-tuning only the last layer (or several last layers), so you save some memory on the parameter gradients and can have bigger batches.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange