Question

In Full-batch Gradient descent or Minibatch-GD we are getting gradient from several training examples. We then average them out to get a "high-quality" gradient, from several estimations and finally use it to correct the network, at once.

But why does averaging the gathered gradient work? Each training sample ends up in a distant, completely separate location on the error-surface

Each sample would thus have its direction of a steepest descent pointing in different directions compared to other training examples. Averaging those directions should not make sense? Yet it works so well. In fact, the more examples we average, the more precise the correction will be: full-batch vs mini-batch approach

enter image description here

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top