Why isn't leaky ReLU always preferable to ReLU given the zero gradient for x<0?

https://datascience.stackexchange.com/questions/16644

16-10-2019
|

Question

It looks to me like the leaky ReLU should have much better performance since the standard ReLU can’t use half of its space (x < 0 where the gradient is zero). But this doesn't happen and in practice most people use standard ReLU.

Solution

One reason that ReL Units have been introduced is to circumvent the problem of vanishing gradients of sigmoidal units at -1 and 1.

Another advantage of ReL Units is that they saturate at exactly 0 allowing for sparse representations, which can be helpful when hidden units are used as input for a classifier. The zero gradient can be problematic in cases where the unit never activates in a gradient based scenario when the unit is initially not activated.

This problem can be alleviated by using leaky ReL Units. On the other hand, leaky ReL Units don't have the ability to create a hard-zero sparse representation which can be useful in certain cases. So, there is a bit of a trade-off and, as in general with NN, it depends on the use cases when which unit performs better. In most cases, if the initial settings can make sure that the ReL Unit is activated (e.g. by setting the biases to small positive values) one would expect ReL and leaky Rel Units to perform very similarly.

Also, leaky RelU (if parametric) introduces another parameter (the slope for $x<0$) that needs to be learned during training and therefore adds more complexity/training time.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange