Why averaging the gradient works in Gradient Descent?

https://datascience.stackexchange.com/questions/33489

gradient-descent
mini-batch-gradient-descent

31-10-2019
|

Question

In Full-batch Gradient descent or Minibatch-GD we are getting gradient from several training examples. We then average them out to get a "high-quality" gradient, from several estimations and finally use it to correct the network, at once.

But why does averaging the gathered gradient work? Each training sample ends up in a distant, completely separate location on the error-surface

Each sample would thus have its direction of a steepest descent pointing in different directions compared to other training examples. Averaging those directions should not make sense? Yet it works so well. In fact, the more examples we average, the more precise the correction will be: full-batch vs mini-batch approach

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange