How does Pytorch deal with non-differentiable activation functions during backprop?

https://datascience.stackexchange.com/questions/77050

12-12-2020
|

题

I've read many posts on how Pytorch deal with non-differentiability in the network due to non-differentiable (or almost everywhere differentiable - doesn't make it that much better) activation functions during backprop. However I was not able to come up with a full picture as to what exactly happens.

Most answers deal with ReLU $\max(0,1)$ and claims that the derivative at $0$ is either taken to be $0$ or $1$ by convention (not sure which one).

But there are many other activation functions with multiple points of non-differentiability.

2 points

4 points

How does Pytorch systematically deal with all these points during backprop? Does anyone have an authoritative answer?

解决方案

The function value is never exactly equal to those exact point because of numerical precision error.And again those functions in torch calculate left or right derivative which is defined in every case.So non-differentiability doesn't pose a problem here.

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange