Question

I have some problems with backpropagation in softmax output layer. I know how it should work but if I try to apply the chain rule in the classical way, I get different results compared to when Softmax is derivated with Cross-Entropy error. Here's an example from a network with a single data-point with 2 output neurons using Softmax as the activation function. $$ Z^{O} := (1.75, 1.75) $$ $$ A^{O} := (0.5, 0.5) $$ Here $ Z^{O} $ is the layer output before applying the SoftMax activation function, while $ A^{O} $ is the activation - guess - of the network.

If I were to derive the SoftMax activation function with the CE error function would give me: $$ \frac{\partial{E(A^{O},T)}}{\partial{Z^{O}}} = \frac{\partial{\sum_{i=0}^{n}{-T_{i}*ln(A^{O})}}}{\partial{Z^{O}}} = \frac{\partial{\sum_{i=0}^{n}{-T_{i}*ln(Softmax(Z^{O}))}}}{\partial{Z^{O}}} = A^{O} - T $$ With: $$ A^{O} = \begin{pmatrix}0.5&0.5\end{pmatrix} $$ $$ T = \begin{pmatrix}0&1\end{pmatrix} $$ This yields: $$ A^{O} - T = \begin{pmatrix}0.5&-0.5\end{pmatrix} $$

This is fine. Now I need to get the same result when applying the chain rule and differentiating the error function and the activation function separately: $$ \frac{\partial{E(A^{O},T)}}{\partial{A^{O}}} * \frac{\partial{A^{O}}}{\partial{Z^{O}}} $$

The first part is trivial: $$ \frac{\partial{E(A^{O},T)}}{\partial{A^{O}}} = \frac{\partial{\sum_{i=0}^{n}{-T_{i}*ln(A_{i}^{O})}}}{\partial{A^{O}}} = -\frac{T_{i}}{A_{i}^{O}} = \begin{pmatrix}-\frac{0}{0.5}&-\frac{1}{0.5}\end{pmatrix} = \begin{pmatrix}0&-2\end{pmatrix} $$

The second part is of course a matrix of $ Softmax^{'}(Z^{O}) $: $$ \frac{\partial{Softmax(Z^{O})}}{\partial{Z^{O}}} = \begin{pmatrix}Z^{O}_{0}(1-Z^{O}_{0})&-Z^{O}_{0}Z^{O}_{1}&...&-Z^{O}_{0}Z^{O}_{n}\\-Z^{O}_{1}Z^{O}_{0}&Z^{O}_{1}(1-Z^{O}_{1})&...&-Z^{O}_{1}Z^{O}_{n}\\...&...&&...\\-Z^{O}_{n}Z^{O}_{0}&-Z^{O}_{n}Z^{O}_{1}&...&Z^{O}_{n}(1-Z^{O}_{n})\end{pmatrix} $$

def df2(d):
    def softmax_prime(f):
        jacobian = np.diag(f)
        for i in range(len(jacobian)):
            for j in range(len(jacobian)):
                if i == j:
                    jacobian[i][j] = f[i] * (1 - f[i])
                else:
                    jacobian[i][j] = -f[i]*f[j]
        return jacobian
return np.apply_along_axis(softmax_prime, axis=1, arr=d)

Which for the above $ \begin{pmatrix}1.75&1.75\end{pmatrix} $ yields: $$ \frac{\partial{A^{O}}}{\partial{Z^O}} = SoftMax'(\begin{matrix}1.75&1.75\end{matrix}) = \begin{pmatrix}-1.3125&-3.0625\\-3.0625&-1.3125\end{pmatrix} $$

Now, according to the chain rule, if I multiply these two together I would get $ \delta^{O} $ for the output layer and it should be the same as the above $ A^{O} - T $. However this gives: $$ \frac{\partial{E(A^{O},T)}}{\partial{A^{O}}} * \frac{\partial{A^{O}}}{\partial{Z^{O}}} = \begin{pmatrix}0&-2\end{pmatrix} * \begin{pmatrix}-1.3125&-3.0625\\-3.0625&-1.3125\end{pmatrix} = (6.125, 2.625) $$

Which is different from the the above original calculation. I'm sure I made a mistake, but I can't see where. The deltas must match that I'm sure about, and fair enough if I use the chain rule like this the error will increase for each epoch.

Can anyone tell me where I went off?

Was it helpful?

Solution

I'm just gonna answer my own question here so that in the future I can find it easier, and hopefully it comes handy for others asking about the derivative of Softmax and gets confused by the fact that it's a matrix. So the shown matrix form for $\frac{\partial{Softmax(x)}}{\partial{x}} $ in the opening question is incorrect. The correct formula should be: $$ \frac{\partial{Softmax(Z^{O})}}{\partial{Z^{O}}} := \begin{pmatrix}\frac{\partial{Softmax(Z^{O}_{0})}}{\partial{Z^{O}_{0}}}&\frac{\partial{Softmax(Z^{O}_{0})}}{\partial{Z^{O}_{1}}}&\frac{\partial{Softmax(Z^{O}_{0})}}{\partial{Z^{O}_{2}}}&...&\frac{\partial{Softmax(Z^{O}_{0})}}{\partial{Z^{O}_{N}}}\\\frac{\partial{Softmax(Z^{O}_{1})}}{\partial{Z^{O}_{0}}}&\frac{\partial{Softmax(Z^{O}_{1})}}{\partial{Z^{O}_{1}}}&\frac{\partial{Softmax(Z^{O}_{1})}}{\partial{Z^{O}_{2}}}&...&\frac{\partial{Softmax(Z^{O}_{1})}}{\partial{Z^{O}_{N}}}\\\frac{\partial{Softmax(Z^{O}_{2})}}{\partial{Z^{O}_{0}}}&\frac{\partial{Softmax(Z^{O}_{2})}}{\partial{Z^{O}_{1}}}&\frac{\partial{Softmax(Z^{O}_{2})}}{\partial{Z^{O}_{2}}}&...&\frac{\partial{Softmax(Z^{O}_{2})}}{\partial{Z^{O}_{N}}}\\...&...&...&&...\\\frac{\partial{Softmax(Z^{O}_{N})}}{\partial{Z^{O}_{0}}}&\frac{\partial{Softmax(Z^{O}_{N})}}{\partial{Z^{O}_{1}}}&\frac{\partial{Softmax(Z^{O}_{N})}}{\partial{Z^{O}_{2}}}&...&\frac{\partial{Softmax(Z^{O}_{N})}}{\partial{Z^{O}_{N}}}\end{pmatrix}$$ Which is: $$ \begin{pmatrix}S(Z^{O}_{0})(1-S(Z^{O}_{0}))&-S(Z^{O}_{0})S(Z^{O}_{1})&-S(Z^{O}_{0})S(Z^{O}_{2})&...&-S(Z^{O}_{0})S(Z^{O}_{N})\\-S(Z^{O}_{1})S(Z^{O}_{0})&S(Z^{O}_{1})(1-S(Z^{O}_{1}))&-S(Z^{1}_{1})S(Z^{O}_{2})&...&-S(Z^{O}_{1})S(Z^{O}_{N})\\-S(Z^{O}_{2})S(Z^{O}_{0})&-S(Z^{O}_{2})S(Z^{O}_{1})&S(Z^{2}_{2})(1-S(Z^{2}_{2}))&...&-S(Z^{O}_{2})S(Z^{O}_{N})\\...&...&...&&...\\-S(Z^{O}_{N})S(Z^{O}_{0})&-S(Z^{O}_{N})S(Z^{O}_{1})&-S(Z^{O}_{N})S(Z^{O}_{2})&...&S(Z^{O}_{N})(1-S(Z^{O}_{N}))\end{pmatrix} $$ For which I incorrectly used the df2 function with z1 argument, as it should be: $$ \begin{pmatrix}A^O_0(1-A^O_0)&-A^O_0A^O_1\\-A^O_0A^O_1&A^O_1(1-A^O_1)\end{pmatrix} $$ Using this, I have: $$ \frac{\partial{A^{O}}}{\partial{Z^{O}}} = Softmax^{'}(A^{O}) = \begin{pmatrix}.25&-.25\\-.25&.25\end{pmatrix} $$ Finally doing the multiplication: $$ \frac{\partial{E(A^{O}, T)}}{\partial{A^{O}}} * \frac{\partial{A^{O}}}{\partial{Z^{O}}} = \begin{pmatrix}0&-2\end{pmatrix}*\begin{pmatrix}0.25&-0.25\\-0.25&0.25\end{pmatrix} = \begin{pmatrix}0.5&-0.5\end{pmatrix} $$ Which matches the original: $$ \frac{\partial{E(A^{O}, T)}}{\partial{Z^{O}}} = A^{O} - T = \begin{pmatrix}0.5&0.5\end{pmatrix} - \begin{pmatrix}0&1\end{pmatrix} = \begin{pmatrix}0.5&-0.5\end{pmatrix} $$ And this is where I messed up.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top