BERT minimal batch size

https://datascience.stackexchange.com/questions/85973

hyperparameter
bert

16-12-2020
|

Pregunta

Is there a minimum batch size for training/re-fining a BERT model on custom data?

Could you name any cases where a mini batch size between 1-8 would make sense?

Would a batch size of 1 make sense at all?

Solución

Small mini-batch size leads to a big variance in the gradients. In theory, with a sufficiently small learning rate, you can learn anything even with very small batches.

In practice, Transformers are known to work best with very large batches. You can simulate large batches by accumulating gradients from the mini-batches and only do the update once in several steps.

Also, when finetuning BERT, you might also think of fine-tuning only the last layer (or several last layers), so you save some memory on the parameter gradients and can have bigger batches.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a datascience.stackexchange