Question

I'm having some difficulty in deriving back propagation with ReLU, and I did some work, but I'm not sure if I'm on the right track.

Cost Function: $\frac{1}{2}(y-\hat y)^2$ where $y$ is the real value, and $\hat y$ is a predicted value. Also assume that $x$ > 0 always.


1 Layer ReLU, where the weight at the 1st layer is $w_1$

enter image description here

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$


2 Layer ReLU, where the weights at the 1st layer is $w_2$, and the 2nd layer is $w_1$ And I wanted to updated the 1st layer $w_2$

enter image description here

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

Since $ReLU(w_1*ReLU(w_2x))=w_1w_2x$


3 Layer ReLU, where the weights at the 1st layer is $w_3$, 2nd layer $w_2$ and 3rd layer $w_1$

enter image description here

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Since $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

Since the chain rule only lasts with 2 derivatives, compared to a sigmoid, which could be as long as $n$ number of layers.


Say I wanted to updated all 3 layer weights, where $w_1$ is the 3rd layer, $w_2$ is the 2nd layer, $w_1$ is the 3rd layer

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

If this derivation is correct, how does this prevent vanishing? Compared to sigmoid, where we have a lot of multiply by 0.25 in the equation, whereas ReLU does not have any constant value multiplication. If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Was it helpful?

Solution

Working definitions of ReLU function and its derivative:

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

The derivative is the unit step function. This does ignore a problem at $x=0$, where the gradient is not strictly defined, but that is not a practical concern for neural networks. With the above formula, the derivative at 0 is 1, but you could equally treat it as 0, or 0.5 with no real impact to neural network performance.


Simplified network

With those definitions, let's take a look at your example networks.

You are running regression with cost function $C = \frac{1}{2}(y-\hat{y})^2$. You have defined $R$ as the output of the artificial neuron, but you have not defined an input value. I'll add that for completeness - call it $z$, add some indexing by layer, and I prefer lower-case for the vectors and upper case for matrices, so $r^{(1)}$ output of the first layer, $z^{(1)}$ for its input and $W^{(0)}$ for the weight connecting the neuron to its input $x$ (in a larger network, that might connect to a deeper $r$ value instead). I have also adjusted the index number for the weight matrix - why that is will become clearer for the larger network. NB I am ignoring having more than neuron in each layer for now.

Looking at your simple 1 layer, 1 neuron network, the feed-forward equations are:

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

The derivative of the cost function w.r.t. an example estimate is:

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

Using the chain rule for back propagation to the pre-transform ($z$) value:

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

This $\frac{\partial C}{\partial z^{(1)}}$ is an interim stage and critical part of backprop linking steps together. Derivations often skip this part because clever combinations of cost function and output layer mean that it is simplified. Here it is not.

To get the gradient with respect to the weight $W^{(0)}$, then it is another iteration of the chain rule:

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

. . . because $z^{(1)} = W^{(0)}x$ therefore $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

That is the full solution for your simplest network.

However, in a layered network, you also need to carry the same logic down to the next layer. Also, you typically have more than one neuron in a layer.


More general ReLU network

If we add in more generic terms, then we can work with two arbitrary layers. Call them Layer $(k)$ indexed by $i$, and Layer $(k+1)$ indexed by $j$. The weights are now a matrix. So our feed-forward equations look like this:

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

In the output layer, then the initial gradient w.r.t. $r^{output}_j$ is still $r^{output}_j - y_j$. However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

  1. $\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

  1. $\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

  1. $\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.


Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$, applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$. The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top