Pergunta

I'm beginner in ML. In the ANN, relu has the gradient of 1 in x>0

how ever, i wonder in x=<0 relu has gradient of 0 and may have gradient vanishing problem in deep neural networks.

if activation function like y=x(for the all the x) has no gradient vanishing problem, why we dose not use this function in deep neural networks? Is there any side effect for y=x(for all x)? (maybe, the weight may go infinity in deep neural networks...... however, I think this problem is also being happen in ReLU. so it is not a problem(I think.))

Foi útil?

Solução

If you are using an activation like y=x, then your model is a simple linear one. Multiple layers with such activation will be equivalent/reduced to only one layer with a linear activation! Thus you can only map linear function satisfactorily with this type of model. To be able to learn complex non-linear functions, you need to use multiple layers with non-linear activation in between to make the whole model non-linear

To prevent the vanishing gradient problem, there is a variant of relu called Leaky ReLU. This activation is same as relu in the positive region of x. For negative region of x, it is a linear function with a small slope (e.g. 0.2). This makes Leaky ReLU a non linear activation at x=0 point.

Licenciado em: CC-BY-SA com atribuição
scroll top