Why is stochastic gradient descent so much worse than batch GD for MNIST task?

https://datascience.stackexchange.com/questions/16881

16-10-2019
|

Question

Here the code from Tensorflow tutorial: A Multilayer Perceptron implementation example

With batch size = 100 we quickly got Accuracy: 94.59%.

If I set the batch size to be one, the training takes ten times more time but the accuracy is only nine percent.

I have tested different learning rates with no luck. SGD performance is terrible for small batch sizes. We can expect that SGD performance will be lower, but not ten times less! What is the reason for this performance loss?

Solution

Why is stochastic gradient descent so much worse then batch GD for MNIST task?

It isn't inherently worse. Instead, by changing just one parameter on its own you have adjusted the example outside of where it has been "tuned" to work, because it is a simplified example for learning purposes, and it is missing some features that most users of NNs would consider standard.

The batch size of 1 is performing just fine. In fact, although it takes longer to process the same number of epochs, each epoch actually has more weight updates. You get 100 times as many weight updates, although each one has far more noise in it than a batch size of 100. It is these extra weight updates, plus extra time spent running interpreted Python code for 100 times as many batches, which adds the large amount of time.

The problem with accuracy is that the example network has no protection from overfitting. By running so many more weight updates, the training starts to learn the precise image data to match each digit, in order to get the best score. And by doing that exclusively it learns rules that work really well on the training data but generalise really badly to new data in the test set.

Try batch size of 1 and number of epochs = 3 (I tried this and got accuracy of 94.32%). Basically that is using early stopping as a form of regularisation. It's not the best form of regularisation, but it is quick to try and often effective - the problem is how to tell when to stop, so you need to measure a test set (often separate to final test set, called cross-validation set) at any potential stopping point, and save the best model so far. That will obviously involve adjusting the example code.

Probably the 15 epochs in the original example has been chosen carefully so as to not make overfitting a problem with batch size of 100, but as soon as you change batch size, without any other form of regularisation, the network is very likely to over-fit. In general neural networks strongly tend to over-fit, and you have to spend time and effort to understand and defend against this.

Have a look at regularisation in TensorFlow for other options. For this kind of problem, dropout (explained lower down page in the link) is highly recommended, which is not purely regularisation, but works to improve generalisation for many neural network problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange