Question

I'm training an LSTM with Keras.

I've noticed that the smaller the batch size, the more the loss decreases during periods: so this makes me think that the network can process fewer items better at a time.

Is it a normal behavior in general?

Was it helpful?

Solution

In general smaller or larger batch size doesn't guarantee better convergence. Batch size is more or less treated as a hyperparameter to tune keeping in the memory constraints you have.

There is a tradeoff for bigger and smaller batch size which have their own disadvantage, making it a hyperparameter to tune in some sense.

Theory says that, bigger the batch size, lesser is the noise in the gradients and so better is the gradient estimate. This allows the model to take a better step towards a minima. However, the challenge is that bigger batch size needs more memory and each step is time consuming.

Even if somehow we can avoid the time and space constraints, bigger batch size still wouldn't give better solution in practice as compared to smaller batch size. This is because the surface of the neural networks objective is generally non-convex, which means that there might be local optimums. Just having an accurate gradient estimate doesn't guarantee us reaching the global optimum (which we seek). It could lead us to a local optimum accurately! Keeping the batch size small makes the gradient estimate noisy which might allow us to bypass a local optimum during convergence. But having very small batch size would be too noisy for the model to convergence anywhere.

So, the optimum batch size depends on the network you are training, data you are training on and the objective function you are trying to optimize.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top