Does the choice of an activation function and initial weights have any bearing on whether a Neural Network gets stuck in a local minima?

https://stackoverflow.com/questions/8057296

24-02-2021
|

Question

I posted this question yesterday asking if my Neural Network (that I'm training via backpropagation using stochastic gradient descent) was getting stuck in a local minima. The following papers talk about the problem of the local minima in an XOR neural network. The first one says that there isn't a problem of a local minima, whereas as the next paper (written a year later) says that there is a problem of a local minima in a 2-3-1 XOR neural network (as an aside, I'm using a 3-3-1 i.e., bias on the input and hidden layers). Both of these are abstracts (I don't have access to the full paper so I'm unable to read it):

XOR has no local minima: A case study in neural network error surface analysis. by Hamey LG. Department of Computing, Macquarie University, Sydney, Australia
A local minimum for the 2-3-1 XOR network. by Sprinkhuizen-Kuyper IG, Boers EW.

There is also another paper [PDF] that says there isn't a local minima for the simplest XOR network, but it doesn't seem to be talking about a 2-3-1 network.

Now onto my actual question: I couldn't find anything that discussed the choice of the activation function, initial weights and what impact this has on whether the neural network will get stuck in a local minima. The reason I'm asking this question is that in my code I have tried using the standard sigmoid activation function and the hyperbolic tangent activation function. I noticed that in the former, I get stuck only around 20% of the time whereas in the latter I tend to get stuck far more often. I'm also randomizing my weights whenever I first initialize the network and so I'm wondering if a certain set of random weights is more disposed to making my neural network get "stuck".

As far as the activation function is concerned, since the error is eventually related to the output produced by the activation function, I'm thinking that there is an effect (i.e., the error surface changes). However, this is simply based on intuition and I'd prefer a concrete answer (for both points: initial weights and choice of the activation function).

Solution

The random weights given to a Neural Network often immediately restrict the portion of the search space that will be available during learning. This is particularly true when learning rates are small.

However, in the XOR case (using a 3-3-1 topology) there should not be any local minima.

My recommendation is that since the network is so tiny that you should print the edge weights when it seems stuck in a local minima. You should be able to quickly evaluate whether or not the weights appear to be correct and how far away the values are from giving you a perfect network.

One trick that made a large difference for me was instead of updating the weights immediately after each piece of training data was to batch the errors up and update the weights at the end of an epoch. That prevented my network from being swayed early on if the first half of my input data belonged to the same classification bucket.

Which brings me to my next point, are you sure you have an evenly distributed number of training examples? If you provide a neural network with 900 positive classification results but only 100 negative classification results sometimes the network thinks it's just easier to say everything is within the classification group because it only has a 10% error rate if it does. Many learning algorithms are extremely good at finding these kinds of things.

Lastly, the activation function should make little-to-no difference whether or not it hits local minima. The activation function serves primarily as a way to project the domain of reals onto a much smaller known range; (0,1) for sigmoid and (-1,1) for the hyperbolic tangent activation function. You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling for linear regression and thusly activation functions must be used but it is otherwise compensated for when computing errors during back propagation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow