Deriving backpropagation equations “natively” in tensor form

https://datascience.stackexchange.com/questions/18279

22-10-2019
|

Question

Image shows a typical layer somewhere in a feed forward network:

$a_i^{(k)}$ is the activation value of the $i^{th}$ neuron in the $k^{th}$ layer.

$W_{ij}^{(k)}$ is the weight connecting $i^{th}$ neuron in the $k^{th}$ layer to the $j^{th}$ neuron in the $(k+1)^{th}$ layer.

$z_j^{(k+1)}$ is the pre-activation function value for the $j^{th}$ neuron in the $(k+1)^{th}$ layer. Sometimes this is called the "logit", when used with logistic functions.

The feed forward equations are as follows:

$z_j^{(k+1)} = \sum_i W_{ij}^{(k)}a_i^{(k)}$

$a_j^{(k+1)} = f(z_j^{(k+1)})$

For simplicity, bias is included as a dummy activation of 1, and implied used in iterations over $i$.

I can derive the equations for back propagation on a feed-forward neural network, using chain rule and identifying individual scalar values in the network (in fact I often do this as a paper exercise just for practice):

Given $\nabla a_j^{(k+1)} = \frac{\partial E}{\partial a_j^{(k+1)}}$ as gradient of error function with respect to a neuron output.

1. $\nabla z_j^{(k+1)} = \frac{\partial E}{\partial z_j^{(k+1)}} = \frac{\partial E}{\partial a_j^{(k+1)}} \frac{\partial a_j^{(k+1)}}{\partial z_j^{(k+1)}} = \nabla a_j^{(k+1)} f'(z_j^{(k+1)})$

2. $\nabla a_i^{(k)} = \frac{\partial E}{\partial a_i^{(k)}} = \sum_j \frac{\partial E}{\partial z_j^{(k+1)}} \frac{\partial z_j^{(k+1)}}{\partial a_i^{(k)}} = \sum_j \nabla z_j^{(k+1)} W_{ij}^{(k)}$

3. $\nabla W_{ij}^{(k)} = \frac{\partial E}{\partial W_{ij}^{(k)}} = \frac{\partial E}{\partial z_j^{(k+1)}} \frac{\partial z_j^{(k+1)}}{\partial W_{ij}^{(k)}} = \nabla z_j^{(k+1)} a_{i}^{(k)}$

So far, so good. However, it is often better to recall these equations using matrices and vectors to represent the elements. I can do that, but I am not able to figure out the "native" representation of the equivalent logic in the middle of the derivations. I can figure out what the end forms should be by referring back to the scalar version and checking that the multiplications have correct dimensions, but I have no idea why I should put the equations in those forms.

Is there actually a way of expressing the tensor-based derivation of back propagation, using only vector and matrix operations, or is it a matter of "fitting" it to the above derivation?

Using column vectors $\mathbf{a}^{(k)}$, $\mathbf{z}^{(k+1)}$, $\mathbf{a}^{(k+1)}$ and weight matrix $\mathbf{W}^{(k)}$ plus bias vector $\mathbf{b}^{(k)}$, then the feed-forward operations are:

$\mathbf{z}^{(k+1)} = \mathbf{W}^{(k)}\mathbf{a}^{(k)} + \mathbf{b}^{(k)}$

$\mathbf{a}^{(k+1)} = f(\mathbf{z}^{(k+1)})$

Then my attempt at derivation looks like this:

1. $\nabla \mathbf{z}^{(k+1)} = \frac{\partial E}{\partial \mathbf{z}^{(k+1)}} = ??? = \nabla \mathbf{a}^{(k+1)} \odot f'(\mathbf{z}^{(k+1)})$

2. $\nabla \mathbf{a}^{(k)} = \frac{\partial E}{\partial \mathbf{a}^{(k)}} = ??? = {\mathbf{W}^{(k)}}^{T} \nabla \mathbf{z}^{(k+1)}$

3. $\nabla \mathbf{W}^{(k)} = \frac{\partial E}{\partial \mathbf{W}^{(k)}} = ??? = \nabla\mathbf{z}^{(k+1)} {\mathbf{a}^{(k)}}^T $

Where $\odot$ represents element-wise multiplication. I've not bothered showing equation for bias.

Where I have put ??? I am not sure of the correct way to go from the feed-forward operations and knowledge of linear differential equations to establish the correct form of the equations? I could just write out some partial derivative terms, but have no clue as to why some should use element-wise multiplication, others matrix multiplication, and why multiplication order has to be as shown, other than clearly that gives the correct result in the end.

I am not even sure if there is a purely tensor derivation, or whether it is all just a "vectorisation" of the first set of equations. But my algebra is not that good, and I'm interested to find out for certain either way. I feel it might do me some good comprehending work in e.g. TensorFlow if I had a better native understanding of these operations by thinking more with tensor algebra.

Sorry about ad-hoc/wrong notation. I understand now that $\nabla a_j^{(k+1)}$ is more properly written $\nabla_{a_j^{(k+1)}}E$ thanks to Ehsan's answer. What I really wanted there is a short reference variable to substitute into the equations, as opposed to the verbose partial derivatives.

Solution

Notation matters! The problem starts from:

Given $\nabla a_j^{(k+1)} = \frac{\partial E}{\partial a_j^{(k+1)}}$

I don't like your notation! it's wrong in fact, in standard mathematical notation. The correct notation is

$$\nabla_{a_j^{(k+1)}} E = \frac{\partial E}{\partial a_j^{(k+1)}}$$

Then, gradient of the error $E$ w.r.t a vector ${\mathbf{a}^{(k)}}$ is defined as

$$\nabla_{\mathbf{a}^{(k)}} E = \left( \frac{\partial E}{\partial a_1^{(k)}} , \cdots, \frac{\partial E}{\partial a_n^{(k)}}\right)^T \;\;\;\; (\star)$$

(side note: We transpose because of the convention that we represent vectors as column vectors, if you'd like to represent as row vectors then the equations you want to prove will change up a transpose!)

therefore with chain rule,

$$\frac{\partial E}{\partial a_i^{(k)}}= \sum_j \frac{\partial E}{\partial z_j^{(k+1)}} \frac{\partial z_j^{(k+1)}}{\partial a_i^{(k)}}=\sum_j \frac{\partial E}{\partial z_j^{(k+1)}}W_{ij}^{(k)}$$

because of $z_j^{(k+1)} = \sum_i W_{ij}^{(k)}a_i^{(k)}.$ Now, you can express the above as vector (inner) product

$$\frac{\partial E}{\partial a_i^{(k)}} = (W_{:,i}^{(k)})^T \nabla_{\mathbf{z}^{(k+1)}} E$$ and stacking them in $(\star),$ we can express $\nabla_{\mathbf{a}^{(k)}} E $ as matrix-vector product

$$\nabla_{\mathbf{a}^{(k)}} E = (\mathbf{W}^{(k)})^T\nabla_{\mathbf{z}^{(k+1)}} E.$$

I'll leave the rest to you :)

More vector calculusy!

Let's use the convention of vectors as column-vectors. Then $\mathbf{z}^{(k+1)} = (\mathbf{W}^{(k)})^T \mathbf{a}^{(k)} + \mathbf{b}^{(k)}$ and

$$\nabla_{\mathbf{a}^{(k)}} E = \frac{\partial E}{\partial \mathbf{a}^{(k)}} = \frac{\partial \mathbf{z^{(k+1)}}}{\partial \mathbf{a}^{(k)}} \frac{\partial E}{\partial \mathbf{z}^{(k+1)}}= \mathbf{W}^{(k)} \frac{\partial E}{\partial \mathbf{z}^{(k+1)}}$$

because

$$\frac{\partial \mathbf{z^{(k+1)}}}{\partial \mathbf{a}^{(k)}} = \dfrac{\partial\left((\mathbf{W}^{(k)})^T \mathbf{a}^{(k)} + \mathbf{b}^{(k)}\right)}{\partial \mathbf{a}^{(k)}}=\dfrac{\partial\left((\mathbf{W}^{(k)})^T \mathbf{a}^{(k)}\right)}{\partial \mathbf{a}^{(k)}} + \dfrac{\partial\mathbf{b}^{(k)}}{\partial \mathbf{a}^{(k)}}$$

and $\dfrac{\partial\mathbf{b}^{(k)}}{\partial \mathbf{a}^{(k)}}=0$ since $\mathbf{b}^{(k)}$ doesn't depend on $\mathbf{a}^{(k)}.$

Thus

$$\dfrac{\partial\left((\mathbf{W}^{(k)})^T \mathbf{a}^{(k)}\right)}{\partial \mathbf{a}^{(k)}} = \dfrac{\partial \mathbf{a}^{(k)}}{\partial \mathbf{a}^{(k)}} \mathbf{W}^{(k)} = \mathbf{W}^{(k)}.$$

by vector-by-vector (eight and seventh row, last column identities, respectively)

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange