I've read many posts on how Pytorch deal with non-differentiability in the network due to non-differentiable (or almost everywhere differentiable - doesn't make it that much better) activation functions during backprop. However I was not able to come up with a full picture as to what exactly happens.

Most answers deal with ReLU $\max(0,1)$ and claims that the derivative at $0$ is either taken to be $0$ or $1$ by convention (not sure which one).

But there are many other activation functions with multiple points of non-differentiability.

enter image description here

2 points

enter image description here

4 points

How does Pytorch systematically deal with all these points during backprop? Does anyone have an authoritative answer?

有帮助吗?

解决方案

The function value is never exactly equal to those exact point because of numerical precision error.And again those functions in torch calculate left or right derivative which is defined in every case.So non-differentiability doesn't pose a problem here.

许可以下: CC-BY-SA归因
scroll top