Intuition in Backpropagation (gradient descent)

https://datascience.stackexchange.com/questions/15689

16-10-2019
|

Question

I am reading about backpropagation, one of many processes involved in gradient descent, and I couldn't quite grasp the intuition in one of the equations.

http://neuralnetworksanddeeplearning.com/chap2.html#MathJax-Element-185-Frame

An equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$: In particular

$$ \delta^l=((w^{l+1})^T\delta^{l+1})\circ\sigma'(z^l), $$

where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ for the $(l+1)^{th}$ layer.

...

When we apply the transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{th}$ layer.

I know how each terms are derived algebraically (from further down the text), but I can't grasp the intuitive part. In particular, I don't understand how applying the transpose of the weight matrix is like moving the error backwards.

The mathematics is quite straightforward. The equation is the simplification of the error in layer $l$ expressed in terms of error in layer $l+1$. But, the whole intuition bit doesn't come so naturally.

Could someone please explain, in what sense applying the transpose is like "moving the error backwards"?

Solution

in what sense applying the transpose is like "moving the error backwards"?

It isn't. Or at least you shouldn't think too hard about the analogy.

What you are actually doing is calculating the gradient - or partial derivative - of the error/cost term with respect to the activations and weights in the network. This is done by repeatedly applying the chain rule to terms that depend on each other in the network, until you have the gradient for all the variables that you control in the network (usually you are interested in changing the weights between neurons, but you can alter other things if they are under your control - this is exactly how deep dreaming works too, instead of treating the weights as a variable, you treat the input image as one).

The analogy is useful in that it explains the progression and goal of the repeated calculations across layers. It is also correct in that values of the gradient in one layer depend critically on values in "higher" layers. This can lead to intuitions about managing problems with gradient values in deep networks.

There is no relation between taking the matrix transpose and "moving the error backwards". Actually you are taking sums over terms connected by the weights in the NN model, and that happens to be the same operation as the matrix multiplication given. The matrix multiplication is a concise way of writing the relationship (and is useful when looking at optimisations), but is not fundamental to understanding neural networks.

The statement "moving the error backwards" is in my opinion a common cause of problems when developers who have not studied the theory carefully try to implement it. I have seen quite a few attempts at implementing back propagation that cargo cult in error terms and multiplications without comprehension. It does not help here that some combinations of cost functions and activation functions in the output layer have been deliberately chosen to have very simple derivatives, so that it looks like the neural network is copying error values around with almost magical multiplications by activation value etc. Actually deriving those simple terms takes a reasonably competent knowledge of basic calculus.

OTHER TIPS

I like to picture pre-multiplication with the weight matrix as a shorthand for multiplying the weights of the connections corresponding to a particular neuron ( say $j^{th}$) neuron in the $l^{th}$ layer to each neuron in the $(l-1)^{th}$ layer.

Similarly, pre-multiplication with the transpose of the weight matrix is like a shorthand for multiplying the weights of the connections corresponding to a particular neuron ( say $k^{th}$) in the $(l-1)^{th}$ layer to each neuron in the $l^{th}$ layer.

This intuition stems from the definition of the weight matrix, as $w^l_{jk}$ is defined such that $w^l_{jk}$ to denote the weight for the connection from the $k^{th}$ neuron in the $(l−1)^{th}$ layer to the $j^{th}$ neuron in the $l^{th}$ layer. $w^l_{jk}$
_{(source: neuralnetworksanddeeplearning.com)}

And as you can interpret pre-multiplication with a matrix as keeping the first subscript constant and looping over the second subscript, the intuition of pre-multiplying with the weight matrix to get the sum over the previous layer's weights and pre-multiplying with the weight matrix to get the sum over next layer's weights would seem natural.

$Matrix Multiplication$

Also I would like to end by quoting Jon von Neumann by saying

"Young man, in mathematics you don't understand things. You just get used to them."

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange