Question

In a blog I read this:
With Stochastic Gradient Descent we don’t compute the exact derivate of our loss function. Instead, we’re estimating it on a small batch.
blog.
Now I am confused with the whole concept.
Why we take estimate of the derivative? Please explain.

Was it helpful?

Solution

That's because the whole loss is $\frac{1}{N} \sum\limits_{i=1}^N L(x_i, y_i)$ and that number $N$ is the dataset size, it can be very large. It's just too slow to compute the true gradient, thus we compute its unbiased estimate via Monte Carlo. There are some theorems that say that stochastic gradient descent converges under certain conditions, so it's a reasonable method. You just don't to wait long (computing the true gradient), you can converge faster. The speed isn't the only reason. Also, researchers found out that using small batch size can improve the performance of neural networks and it's reasonable as well because the lower the batch size the higher is the variance of the estimate, and the higher variance (i.e. noise) and the higher variance prevents the net from overfitting.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top