Gradients for bias terms in backpropagation
-
22-10-2019 - |
Question
I was trying to implement neural network from scratch to understand the maths behind it. My problem is completely related to backpropagation when we take derivative with respect to bias) and I derived all the equations used in backpropagation. Now every equation is matching with the code for neural network except for that the derivative with respect to biases.
z1=x.dot(theta1)+b1
h1=1/(1+np.exp(-z1))
z2=h1.dot(theta2)+b2
h2=1/(1+np.exp(-z2))
dh2=h2-y
#back prop
dz2=dh2*(1-dh2)
H1=np.transpose(h1)
dw2=np.dot(H1,dz2)
db2=np.sum(dz2,axis=0,keepdims=True)
I looked up online for the code, and i want to know
why do we add up the matrix and then the scalar db2=np.sum(dz2,axis=0,keepdims=True)
is subtracted from the original bias, why not the matrix as a whole is subtracted. Can anyone help me to give some intuion behind it. If i take partial derivative of loss with respect to bias it will give me upper gradient only which is dz2 because z2=h1.dot(theta2)+b2
h1 and theta will be 0 and b2 will be 1. So the upper term will be left.
b2+=-alpha*db2
Solution
The bias term is very simple, which is why you often don't see it calculated. In fact
db2 = dz2
So your update rules for bias on a single item are:
b2 += -alpha * dz2
and
b1 += -alpha * dz1
In terms of the maths, if your loss is $J$, and you know $\frac{\partial J}{\partial z_i}$ for a given neuron $i$ which has bias term $b_i$ . . .
$$\frac{\partial J}{\partial b_i} = \frac{\partial J}{\partial z_i} \frac{\partial z_i}{\partial b_i}$$
and
$$\frac{\partial z_i}{\partial b_i} = 1$$
because $z_i = (\text{something unaffected by } b_i) + b_i$
It looks like the code you copied uses the form
db2=np.sum(dz2,axis=0,keepdims=True)
because the network is designed to process examples in (mini-)batches, and you therefore have gradients calculated for more than one example at a time. The sum is squashing the results down to a single update. This would be easier to confirm if you also showed update code for weights.
OTHER TIPS
I would like to explain the meaning of db2=np.sum(dz2,axis=0,keepdims=True)
as it also confused me once and it didn't get answered.
The derivative of L
(loss) w.r.t. b
is the upstream derivative multiplied with the local derivate:
$$
\frac{ \partial L}{\partial \mathbf{b}} = \frac{ \partial L}{\partial Z} \frac{ \partial Z}{\partial \mathbf{b}}
$$
If we have multiple samples Z
and L
are both matrices. b is still a vector.
The local derivative is simply a vector of ones: $$ \frac{ \partial Z}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}} W \times X + \mathbf{b} = \mathbf{1} $$
That means our complete derivative is a matrix multiplication, that looks as follows (e.g. 2 samples with 3 outputs): $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ \end{bmatrix} $$
Note that this is the sum of the rows.
And that's where db2=np.sum(dz2,axis=0,keepdims=True)
comes from. It is simply an abbreviation for the matrix multiplication of the local and the upstream derivatives.