
I was trying to implement neural network from scratch to understand the maths behind it. My problem is completely related to backpropagation when we take derivative with respect to bias) and I derived all the equations used in backpropagation. Now every equation is matching with the code for neural network except for that the derivative with respect to biases.


#back prop


I looked up online for the code, and i want to know why do we add up the matrix and then the scalar db2=np.sum(dz2,axis=0,keepdims=True) is subtracted from the original bias, why not the matrix as a whole is subtracted. Can anyone help me to give some intuion behind it. If i take partial derivative of loss with respect to bias it will give me upper gradient only which is dz2 because h1 and theta will be 0 and b2 will be 1. So the upper term will be left.

Was it helpful?


The bias term is very simple, which is why you often don't see it calculated. In fact

db2 = dz2

So your update rules for bias on a single item are:

b2 += -alpha * dz2


b1 += -alpha * dz1

In terms of the maths, if your loss is $J$, and you know $\frac{\partial J}{\partial z_i}$ for a given neuron $i$ which has bias term $b_i$ . . .

$$\frac{\partial J}{\partial b_i} = \frac{\partial J}{\partial z_i} \frac{\partial z_i}{\partial b_i}$$


$$\frac{\partial z_i}{\partial b_i} = 1$$

because $z_i = (\text{something unaffected by } b_i) + b_i$

It looks like the code you copied uses the form


because the network is designed to process examples in (mini-)batches, and you therefore have gradients calculated for more than one example at a time. The sum is squashing the results down to a single update. This would be easier to confirm if you also showed update code for weights.


I would like to explain the meaning of db2=np.sum(dz2,axis=0,keepdims=True) as it also confused me once and it didn't get answered.

The derivative of L (loss) w.r.t. b is the upstream derivative multiplied with the local derivate: $$ \frac{ \partial L}{\partial \mathbf{b}} = \frac{ \partial L}{\partial Z} \frac{ \partial Z}{\partial \mathbf{b}} $$

If we have multiple samples Z and L are both matrices. b is still a vector.

The local derivative is simply a vector of ones: $$ \frac{ \partial Z}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}} W \times X + \mathbf{b} = \mathbf{1} $$

That means our complete derivative is a matrix multiplication, that looks as follows (e.g. 2 samples with 3 outputs): $$ \frac{\partial L}{\partial Z} \times \mathbf{1} = \begin{bmatrix} . &. &. \\ . &. &. \end{bmatrix} \begin{bmatrix} 1\\ 1\\ \end{bmatrix} $$

Note that this is the sum of the rows.

And that's where db2=np.sum(dz2,axis=0,keepdims=True) comes from. It is simply an abbreviation for the matrix multiplication of the local and the upstream derivatives.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top