Question

I have a simple neural network with one hidden layer and softmax as the activation function for the output layer. The hidden layer uses various activation functions since I am testing and implementing as many of them as I can.

For training and testing for the moment I am using the MNIST dataset of handwritten digits so my input data is a matrix that in each row has a different image and in each column a pixel of that image that has been reshaped as a vector.

When I use a sigmoid activation function for both layers the computed gradients and analytical gradients seem to agree but when I try something else like tanh or softplus for the hidden layer and softmax for the output layer there are big differences as can be seen from the data below (Left->Numerical Gradient, Right->Analytical Gradient)

(1)sigmoid (2)softmax

  -9.4049e-04  -6.4143e-04
  -6.2623e-05  -2.5895e-05
   1.0676e-03   6.9474e-04
  -2.0473e-03  -1.3471e-03
   2.9846e-03   1.9716e-03
   4.0945e-05   2.7627e-05
  -2.5102e-05  -1.7017e-05
   8.8054e-06   6.0967e-06
   7.8509e-06   5.0682e-06
  -2.4561e-05  -1.6270e-05
   5.6108e-05   3.8449e-05
   2.0690e-05   1.2590e-05
  -9.7665e-05  -6.3771e-05
   1.7235e-04   1.1345e-04
  -2.4335e-04  -1.6071e-04

(1)tanh (2)softmax

  -3.9826e-03  -2.7402e-03
   4.6667e-05   1.1115e-04
   3.9368e-03   2.5504e-03
  -7.7824e-03  -5.1228e-03
   1.1451e-02   7.5781e-03
   1.5897e-04   1.0734e-04
  -9.6886e-05  -6.5701e-05
   3.3560e-05   2.3153e-05
   3.3344e-05   2.1786e-05
  -1.0282e-04  -6.8409e-05
   2.1185e-04   1.4774e-04
   9.0293e-05   5.3752e-05
  -4.0012e-04  -2.6047e-04
   6.9648e-04   4.5839e-04
  -9.7518e-04  -6.4468e-04

(1)sigmoid (2)sigmoid

-9.2783e-03  -9.2783e-03
   8.8991e-03   8.8991e-03
  -8.3601e-03  -8.3601e-03
   7.6281e-03   7.6281e-03
  -6.7480e-03  -6.7480e-03
  -3.0498e-06  -3.0498e-06
   1.4287e-05   1.4287e-05
  -2.5938e-05  -2.5938e-05
   3.6988e-05   3.6988e-05
  -4.6876e-05  -4.6876e-05
  -1.7506e-04  -1.7506e-04
   2.3315e-04   2.3315e-04
  -2.8747e-04  -2.8747e-04
   3.3532e-04   3.3532e-04
  -3.7622e-04  -3.7622e-04
  -9.6266e-05  -9.6266e-05

The way I implement backpropagation is as follows:

Variables->

Theta1, Theta2: tables with the various weights for the first and second layer.

m: size of my training set

y: a vector with the correct category for every input sample

Y: a matrix with the one hot encoding for the category for every input sample

X: a matrix with input data, each row is a different training sample

% Feedforward
a1 = [ones(m, 1) X];
z2 = a1*Theta1';
a2 = [ones(m, 1) activation(z2, activation_type)];
z3 = a2*Theta2';
a3 = activation(z3, 'softmax');
h = a3;

% Calculate J
J = sum(sum((-Y).*log(h) - (1-Y).*log(1-h), 2))/m + lambda*p/(2*m); # sigmoid
%J = -(sum(sum((Y).*log(h))) + lambda*p/(2*m)); # softmax

% Calculate sigmas
sigma3 = a3.-Y;
sigma2 = (sigma3*Theta2).*activationGradient([ones(m, 1) z2], 'sigmoid');
sigma2 = sigma2(:, 2:end);

% Accumulate gradients
delta_1 = (sigma2'*a1);
delta_2 = (sigma3'*a2);

The first cost calculation J was computed for sigmoid and the one below it for softmax (see comments) so I switch between the two.

Have I missed something during backpropagation, why is it working as expected with sigmoids but not as expected with sofmax?

Was it helpful?

Solution

I think it might be a relatively trivial bug in your cost function for softmax:

J = -(sum(sum((Y).*log(h))) + lambda*p/(2*m)) 

should be

J = -sum(sum((Y).*log(h)))/m + lambda*p/(2*m) 

I.e. for softmax only, you have effectively subtracted the regularisation term from the cost function instead of adding it. Also, you forgot to divide the error term by the number of examples in the batch (and you are taking this average when calculating the gradients)

Your back propagation calculations look correct to me if you correct this miscalculation for J.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top