Mini Batch Gradient Descent shuffling

https://datascience.stackexchange.com/questions/73098

10-12-2020
|

Question

My data set is of shape (60,784,1000) with mini batches for input and (60,10,1000) for labels, should I shuffle only the 60 mini batches or the training examples themselves?

Solution

Normally, you would shuffle up all of the examples and then portion them off into batches of some chosen size. Then, you would do a parameter update based on the gradient of the loss function with respect to each batch. This whole process is one "epoch" of training. Typically, deep neural nets are then trained over many epochs, often with a learning rate that varies as training proceeds.

An important aspect of this process is that when the data is shuffled up at the beginning of an epoch, examples are put into batches with different examples than they were matched with in the previous epoch. This gives us a more complete sampling of batch gradients and improves our collective stochastic estimation of the optimal gradient (the derivative of the cost function with respect to the model parameters and data).

Short answer: your model performance will almost certainly be worse if you choose static batches and shuffle those batches around instead of shuffling the data and then dividing them into batches.

Also, be careful of your shapes. With a dataset shape of (60, 784, 1000), it's highly likely that you're working on MNIST or one of its cousins - MNIST is 60,000 examples of length (784,), if the images have been flattened down from 28 x 28 pixels. The 784 being axis 1 in your shape tuple makes me concerned that you've reshaped your data incorrectly. Make sure that the array entries that represent a single image are where you think they are.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange