How similar is Adam optimization and Gradient clipping?

https://datascience.stackexchange.com/questions/76757

12-12-2020
|

Pergunta

According to the Adam optimization update rule: $$m \leftarrow \beta_1 m + (1 - \beta_1)\nabla J(\theta)$$ $$v \leftarrow \beta_2 v + (1 - \beta_2)(\nabla J(\theta) \odot \nabla J(\theta))$$ $$\theta \leftarrow \theta - \alpha \frac{m}{\sqrt{v}}$$

From the equations, it is clear that $m$ is accumulated gradient for each $\theta$ based on an exponential decay function and $v$ is doing a similar thing (kind of) but with the magnitude of the gradient. Then, when we update the parameters $\theta$, we divide the accumulated gradient with square root of the accumulated magnitude to only update those parameters largely which haven't been updated much and vice-versa.

In gradient clipping, we do a kind of similar thing by scaling the gradient vector with respect to a threshold. My question is, why do we need gradient clipping to solve the problem of gradient explosion when we can use Adam optimizer to do a controlled search of the space for the minima.

Solução

Both have a different role and approach. So, I will say both are not comparable.

Gradient Clipping has a simple task to clip the Gradient to a certain threshold. Its job is done as soon as this is achieved, i.e. Gradient is brought to a decent level to avoid the Explosion.
It has no responsibility to see if learning will converge to best possible minima of Loss.

Adam, on the other hand, is an optimizer. It came as an improvement over RMSprop. The improvement was to have the goodness of both, i.e. Momentum and RMSProp (Read this answer)
Adam is expected to help the learning converge towards the minimum loss function when it is going in Valley or Plateau by managing both the Momentum and Coordinate specific Gradient.

Coming back to your question

why do we need gradient clipping to solve the problem of gradient explosion when we can use Adam optimizer to do a controlled search of the space for the minima.

Let's say, in the first iteration, the Model accumulate very large Gradient because of its depth. So, it will take a considerable jump when updating the weight ($\theta$) and can land to any random coordinates in the Loss function space.
Adam can definitely provide a brake on the Gradient, but it has few challenges -

It needs a few iterations to accumulate the $v$ to apply the brake.
But it is very much possible that during these iterations, the point will take any random coordinates in the space. This randomness will also cause all the coordinates to have random Gradients i.e. no relation to the last Gradient.
This will not let Adam do any accumulation and it will have no clue to act in a corrective manner. Everything will happen haphazardly.

Also, keep in mind, exploding Gradient has bidirectional causation, i.e. explosion causes more Gradient, and then this significant Gradient adds more to the Explosion. So, the learning reaches very quickly to NaN state.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange