Why is learning rate causing my neural network's weights to skyrocket?

https://datascience.stackexchange.com/questions/15962

16-10-2019
|

Question

I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with no avail. Finally, I noticed that a change in the learning rate made an unbelievable difference in my weights.

Using a learning rate of .001 (which I thought was pretty conservative), the minimize function would actually exponentially raise the loss. After one epoch the loss could jump from a number in the thousands to a trillion and then to infinity ('nan'). When I lowered the learning rate to .0001, everything worked fine.

1) Why does a single order of magnitude have such an effect?

2) Why does the minimize function literally perform the opposite of its function and maximize the loss? It seems to me that that shouldn't occur, no matter the learning rate.

Solution

You might find Chapter 8 of Deep Learning helpful. In it, the authors discuss training of neural network models. It's very intricate, so I'm not surprised you're having difficulties.

One possibility (besides user error) is that your problem is highly ill-conditioned. Gradient descent methods use only the first derivative (gradient) information when computing an update. This can cause problems when the second derivative (the Hessian) is ill-conditioned.

Quoting from the authors:

Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix $H$. This is a very general problem in most numerical optimization, convex or otherwise, and is described in more detail in section 4.3.1.

The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function. [my emphasis added]

The authors provide a simple derivation to show that this can be the case. Using gradient descent, the cost function should change (to second order) by

\begin{equation} \frac{\varepsilon^2}{2} g^{T} H g - \varepsilon g^{T} g \end{equation}

where $g$ is the gradient, $H$ is the Hessian, and $\varepsilon$ is the learning rate. Clearly, if the second derivatives are large, then the first term can swamp the second, and the cost function will increase, not decrease. Since the first and second terms scale differently with $\varepsilon$, one way to alleviate this problem is to reduce $\varepsilon$ (although, of course, this can result in learning too slowly).

OTHER TIPS

1) Why does a single order of magnitude have such an effect?

2) Why does the minimize function literally perform the opposite of its function and maximize the loss? It seems to me that that shouldn't occur, no matter the learning rate.

There are two main reasons. First one you are not using the same data in the first step than in the second. If in the first step the model learns those values and falls into a local minimum then it is very likely to give a bigger loss for new values.

Second reason is the shape of the cost function. You try to minimize the value by small steps, the length of those steps is given by two factors: the gradient and the learning rate. Image your function is like x^2. If your values are close to 0, the gradient is going to be small than if it further, but if your learning rate is big then, instead of getting closer to 0 you actually increase the error because your new point based on the grading and the learning rate is further to 0 than your previous step. And this can happen several times.

Take a look to this link: http://www.statisticsviews.com/details/feature/5722691/Getting-to-the-Bottom-of-Regression-with-Gradient-Descent.html

If you see the figures with alpha 0.01 and alpha 0.12, you will see how in the first figure the learning rate is small and so the gradient is getting closer to the minimum but in the second case the learning rate is so big that the gradient moves further in every step.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange