Does convergence of loss function is always guarnteed?

https://datascience.stackexchange.com/questions/80232

13-12-2020
|

Question

Which of the following is true, given the optimal learning rate?

(i) For convex loss functions (i.e. with a bowl shape), batch gradient descent is guaranteed to eventually converge to the global optimum while stochastic gradient descent is not.

(ii) For convex loss functions (i.e. with a bowl shape), stochastic gradient descent is guaranteed to eventually converge to the global optimum while batch gradient

descent is not.

(iii) For convex loss functions (i.e. with a bowl shape), both stochastic gradient descent and batch gradient descent will eventually converge to the global optimum.

(iv) For convex loss functions (i.e. with a bowl shape), neither stochastic gradient descent nor batch gradient descent are guaranteed to converge to the global optimum

Which option is correct and why?

Solution

(iii), if you add this clause

provided an optimum or smaller than optimum learning rate and the training dataset is shuffled

Why
When we get the Gradient of the full batch, it's towards the global minima. So with a controlled LR, you will reach there.
With stochastic GD, the individual gradients will not be towards the global minima but it will be with each set of few records. Obviously, it will look a bit zig-zag. For the same reason, it might miss the exact minima point and bounce around it.
In a theoretical worse case, if the dataset is sorted on Class, then it will move in the direction one Class and then the other and most likely miss the global minima.

Reference excerpt from Hands-On Machine Learning

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down (see Figure 4-9). So once the algorithm stops, the final parameter values are good, but not optimal."

When using Stochastic Gradient Descent, the training instances must be independent and identically distributed (IID) to ensure that the parameters get pulled toward the global optimum, on average. A simple way to ensure this is to shuffle the instances during training (e.g., pick each instance randomly, or shuffle the training set at the beginning of each epoch). If you do not shuffle the instances—for example, if the instances are sorted by label—then SGD will start by optimizing for one label, then the next, and so on, and it will not settle close to the global minimum.

OTHER TIPS

Easy if you know that stochastic Gradient descent is a Special case of Batch Gradient descent than you know that either they both are or are not. Since there is no Option of them both not beeing, there can only be (iii). Without knowing anything About why they actually must converge.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange