Question

[I've cross-posted it to cross.validated because I'm not sure where it fits best]

How does gradient descent work for training a neural network if I choose mini-batch (i.e., sample a subset of the training set)? I have thought of three different possibilities:

Epoch starts. We sample and feedforward one minibatch only, get the error and backprop it, i.e. update the weights. Epoch over.

Epoch starts. We sample and feedforward a minibatch, get the error and backprop it, i.e. update the weights. We repeat this until we have sampled the full data set. Epoch over.

Epoch starts. We sample and feedforward a minibatch, get the error and store it. We repeat this until we have sampled the full data set. We somehow average the errors and backprop them by updating the weights. Epoch over.

Was it helpful?

Solution

Let us say that the output of one neural network given it's parameters is $$f(x;w)$$ Let us define the loss function as the squared L2 loss (in this case). $$L(X,y;w) = \frac{1}{2n}\sum_{i=0}^{n}[f(X_i;w)-y_i]^2$$ In this case the batchsize will be denoted as $n$. Essentially what this means is that we iterate over a finite subset of samples with the size of the subset being equal to your batch-size, and use the gradient normalized under this batch. We do this until we have exhausted every data-point in the dataset. Then the epoch is over. The gradient in this case: $$\frac{\partial L(X,y;w)}{\partial w} = \frac{1}{n}\sum_{i=0}^{n}[f(X_i;w)-y_i]\frac{\partial f(X_i;w)}{\partial w}$$ Using batch gradient descent normalizes your gradient, so the updates are not as sporadic as if you have used stochastic gradient descent.

OTHER TIPS

When you train with mini-batches then you have the second option, network is updated after each mini-batch, and epoch ends after presenting all samples.

Please see these responses

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top