As RELU is not differentiable when it touches the x-axis, doesn't it effect training?

https://datascience.stackexchange.com/questions/81775

14-12-2020
|

Question

When I read about activation functions , I read that the reason we don't use step function is because, it is non differentiable which leads to problem in gradient descent.
I am a beginner in deep learning , as Relu is almost a linear function and also non differentiable where it touches x-axis , why it performs so much better than tanh or sigmoid functions. And why is it so widely used in Deep learning.
As it is non differentiable doesn't it affect in training?

La solution

A step function is discontinuous and its first derivative is a Dirac delta function. The discontinuity causes the issue for gradient descent. Further the zero slope everywhere leads to issues when attempting to minimize the function. The function is essentially saturated for values greater than and less than zero.

By contrast RELU is continuous and only its first derivative is a discontinuous step function. Since the RELU function is continuous and well defined, gradient descent is well behaved and leads to a well behaved minimization. Further, RELU does not saturate for large values greater than zero. This is in contrast to sigmoids or tanh, which tend to saturate for large value. RELU maintains a nice linear slope as x moves toward infinity.

The issue with saturation is that gradient descent methods take a long time to find the minimum for a saturated function.

Summarizing:

Step function: discontinuous and saturated at +/- large numbers.
Tanh: Continuous and well defined, but saturated at +/- large numbers.
Sigmoid: Continuous and well defined, but saturated at +/- large numbers.
Relu: Continuous and well defined. Does not saturate at + large number.

Hope this helps!

Licencié sous: CC-BY-SA avec attribution

Non affilié à datascience.stackexchange