You're correct that if you use a step function for your activation function g
, the gradient is always zero (except at 0), so the delta rule (aka gradient descent) just does nothing (dw = 0
). This is why a step-function perceptron doesn't work well with gradient descent. :)
For a linear perceptron, you'd have g'(x) = 1
, for dw = alpha * (t_i - y_i) * x_i
.
You've seen code that uses dw = alpha * (t_i - y_i) * h_j * x_i
. We can reverse-engineer what's going on here, because apparently g'(h_j) = h_j
, which means remembering our calculus that we must have g(x) = e^x + constant
. So apparently the code sample you found uses an exponential activation function.
This must mean that the neuron outputs are constrained to be on (0, infinity)
(or I guess (a, infinity)
for any finite a
, for g(x) = e^x + a
). I haven't run into this before, but I see some references online. Logistic or tanh activations are more common for bounded outputs (either classification or regression with known bounds).