Question

Say I am training a neural network and can fit all my data into memory. Are there any benefits to using mini batches with SGD in this case? Or is batch training with the full gradient always superior when possible?

Also, it seems like many of the more modern optimization algorithms (RMSProp, Adam, etc.) were designed with SGD in mind. Are these methods still superior to standard gradient descent (with momentum) with the full gradient available?

Was it helpful?

Solution

On large datasets, SGD can converge faster than batch training because it performs updates more frequently. We can get away with this because the data often contains redundant information, so the gradient can be reasonably approximated without using the full dataset. Minibatch training can be faster than training on single data points because it can take advantage of vectorized operations to process the entire minibatch at once. The stochastic nature of online/minibatch training can also make it possible to hop out of local minima that might otherwise trap batch training.

One reason to use batch training is cases where the gradient can't be approximated using individual points/minibatches (e.g. where the loss function can't be decomposed as a sum of errors for each data point). This isn't an issue for standard classification/regression problems.

I don't recall seeing RMSprop/Adam/etc. compared to batch gradient descent. But, given their potential advantages over vanilla SGD, and the potential advantages of vanilla SGD over batch gradient descent, I imagine they'd compare favorably.

Of course, we have to keep the no free lunch theorem in mind; there must exist objective functions for which each of these optimization algorithms performs better than the others. But, there's no guarantee whether or not these functions pertain to the set of practically useful, real-world learning problems.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top