Gradients for bias terms in backpropagation

https://datascience.stackexchange.com/questions/20139

22-10-2019
|

Question

I was trying to implement neural network from scratch to understand the maths behind it. My problem is completely related to backpropagation when we take derivative with respect to bias) and I derived all the equations used in backpropagation. Now every equation is matching with the code for neural network except for that the derivative with respect to biases.

z1=x.dot(theta1)+b1

h1=1/(1+np.exp(-z1))
z2=h1.dot(theta2)+b2
h2=1/(1+np.exp(-z2))

dh2=h2-y
#back prop

dz2=dh2*(1-dh2)
H1=np.transpose(h1)
dw2=np.dot(H1,dz2)
db2=np.sum(dz2,axis=0,keepdims=True)

I looked up online for the code, and i want to know why do we add up the matrix and then the scalar db2=np.sum(dz2,axis=0,keepdims=True) is subtracted from the original bias, why not the matrix as a whole is subtracted. Can anyone help me to give some intuion behind it. If i take partial derivative of loss with respect to bias it will give me upper gradient only which is dz2 because z2=h1.dot(theta2)+b2 h1 and theta will be 0 and b2 will be 1. So the upper term will be left.

b2+=-alpha*db2

Solution

The bias term is very simple, which is why you often don't see it calculated. In fact

db2 = dz2

So your update rules for bias on a single item are:

b2 += -alpha * dz2

and

b1 += -alpha * dz1

In terms of the maths, if your loss is $J$, and you know $\frac{\partial J}{\partial z_i}$ for a given neuron $i$ which has bias term $b_i$ . . .

$$\frac{\partial J}{\partial b_i} = \frac{\partial J}{\partial z_i} \frac{\partial z_i}{\partial b_i}$$

and

$$\frac{\partial z_i}{\partial b_i} = 1$$

because $z_i = (\text{something unaffected by } b_i) + b_i$

It looks like the code you copied uses the form

db2=np.sum(dz2,axis=0,keepdims=True)

because the network is designed to process examples in (mini-)batches, and you therefore have gradients calculated for more than one example at a time. The sum is squashing the results down to a single update. This would be easier to confirm if you also showed update code for weights.

OTHER TIPS

I would like to explain the meaning of db2=np.sum(dz2,axis=0,keepdims=True) as it also confused me once and it didn't get answered.

The derivative of L (loss) w.r.t. b is the upstream derivative multiplied with the local derivate: $$ \frac{ \partial L}{\partial \mathbf{b}} = \frac{ \partial L}{\partial Z} \frac{ \partial Z}{\partial \mathbf{b}} $$

If we have multiple samples Z and L are both matrices. b is still a vector.

The local derivative is simply a vector of ones: $$ \frac{ \partial Z}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}} W \times X + \mathbf{b} = \mathbf{1} $$

That means our complete derivative is a matrix multiplication, that looks as follows (e.g. 2 samples with 3 outputs): $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ \end{bmatrix} $$

Note that this is the sum of the rows.

And that's where db2=np.sum(dz2,axis=0,keepdims=True) comes from. It is simply an abbreviation for the matrix multiplication of the local and the upstream derivatives.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange