implementing forward and backward of a Linear model

https://datascience.stackexchange.com/questions/80698

13-12-2020
|

Вопрос

I'm implementing the code of this abstraction.

The forward is easy and looks like that:

I don't understand the backward path and how it fit's the abstraction in the first image:

Why is db defined as multiplication of ones of x's shape and dout ?
Why is dw defined as multiplication of ones of x.T and dout ?
Why both of them are accumulated. i.e it is used += and not = ?
Why is dw defined as multiplication of ones of dout and w.T ?

Решение

This is because the derivative wrt $b$ is $1$: $\frac{\partial E}{\partial b} = 1$
dout is the derivative of loss function wrt prediction. Using chain rule, $$ \frac{dE}{dw} = \frac{dE}{dy}\frac{dy}{ds}\frac{ds}{dw} $$ The last term is the vector of input features $x$. In your case dout is the combination of the first two terms. For example, for MSE loss and sigmoid activation dout $= (y-L)y(1-y)$
This is often used in optimizers for momentum calculation
For MLPs, you need to compute gradients for coarse layers using gradients of deep layers. For example, for MLP with one hidden layer with features $\mathbf{z}$ (hence 3 in total) vector of gradients wrt weights in the input layer $\mathbf{w}^0$ would be $$ y= \sigma(\sum_kw^1_k \cdot\sigma(\sum_jw^0_jx_j))\\ \frac{\partial E}{\partial \mathbf{w^0}} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial s} \frac{\partial s}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{w}^0} = \frac{\partial E}{\partial \mathbf{z}}\frac{\partial \mathbf{z}}{\partial \mathbf{w}^{0}}\\ \frac{\partial E}{\partial \mathbf{w^0}} = (y-L) y(1-y) \sum_j\frac{\partial s}{\partial z_j}\frac{\partial z_j}{\partial \mathbf{w^0}} = (y-L) y(1-y) \sum_j\frac{\partial s}{\partial z_j}\frac{\partial z_j}{\partial s_j}\sum_i \frac{\partial s_j}{\partial w_{ij}}\\ \frac{\partial E}{\partial \mathbf{z}} = (y-L)y(1-y)\frac{\partial s}{\partial \mathbf{z}} = (y-L)y(1-y)\mathbf{w}^1 $$ So, in other words, in order to compute gradients for weights in the input layer, you need gradients wrt neurons in the hidden layer

Лицензировано под: CC-BY-SA с атрибуция

Не связан с datascience.stackexchange