Pergunta

I've read many posts on how Pytorch deal with non-differentiability in the network due to non-differentiable (or almost everywhere differentiable - doesn't make it that much better) activation functions during backprop. However I was not able to come up with a full picture as to what exactly happens.

Most answers deal with ReLU $\max(0,1)$ and claims that the derivative at $0$ is either taken to be $0$ or $1$ by convention (not sure which one).

But there are many other activation functions with multiple points of non-differentiability.

enter image description here

2 points

enter image description here

4 points

How does Pytorch systematically deal with all these points during backprop? Does anyone have an authoritative answer?

Foi útil?

Solução

The function value is never exactly equal to those exact point because of numerical precision error.And again those functions in torch calculate left or right derivative which is defined in every case.So non-differentiability doesn't pose a problem here.

Licenciado em: CC-BY-SA com atribuição
scroll top