Question

When I started with artificial neural networks (NN) I thought I'd have to fight overfitting as the main problem. But in practice I can't even get my NN to pass the 20% error rate barrier. I can't even beat my score on random forest!

I'm seeking some very general or not so general advice on what should one do to make a NN start capturing trends in data.

For implementing NN I use Theano Stacked Auto Encoder with the code from tutorial that works great (less than 5% error rate) for classifying the MNIST dataset. It is a multilayer perceptron, with softmax layer on top with each hidden later being pre-trained as autoencoder (fully described at tutorial, chapter 8). There are ~50 input features and ~10 output classes. The NN has sigmoid neurons and all data are normalized to [0,1]. I tried lots of different configurations: number of hidden layers and neurons in them (100->100->100, 60->60->60, 60->30->15, etc.), different learning and pre-train rates, etc.

And the best thing I can get is a 20% error rate on the validation set and a 40% error rate on the test set.

On the other hand, when I try to use Random Forest (from scikit-learn) I easily get a 12% error rate on the validation set and 25%(!) on the test set.

How can it be that my deep NN with pre-training behaves so badly? What should I try?

Was it helpful?

Solution

The problem with deep networks is that they have lots of hyperparameters to tune and very small solution space. Thus, finding good ones is more like an art rather than engineering task. I would start with working example from tutorial and play around with its parameters to see how results change - this gives a good intuition (though not formal explanation) about dependencies between parameters and results (both - final and intermediate).

Also I found following papers very useful:

They both describe RBMs, but contain some insights on deep networks in general. For example, one of key points is that networks need to be debugged layer-wise - if previous layer doesn't provide good representation of features, further layers have almost no chance to fix it.

OTHER TIPS

While ffriend's answer gives some excellent pointers for learning more about how neural networks can be (extremely) difficult to tune properly, I thought it might be helpful to list a couple specific techniques that are currently used in top-performing classification architectures in the neural network literature.

Rectified linear activations

The first thing that might help in your case is to switch your model's activation function from the logistic sigmoid -- $f(z) = \left(1 + e^{-z}\right)^{-1}$ -- to a rectified linear (aka relu) -- $f(z) = \max(0, z)$.

The relu activation has two big advantages:

  • its output is a true zero (not just a small value close to zero) for $z \le 0$ and
  • its derivative is constant, either 0 for $z \le 0$ or 1 for $z > 0$.

A network of relu units basically acts like an ensemble of exponentially many linear networks, because units that receive input $z \le 0$ are essentially "off" (their output is 0), while units that receive input $z > 0$ collapse into a single linear model for that input. Also the constant derivatives are important because a deep network with relu activations tends to avoid the vanishing gradient problem and can be trained without layerwise pretraining.

See "Deep Sparse Rectifier Neural Networks" by Glorot, Bordes, & Bengio (http://jmlr.csail.mit.edu/proceedings/papers/v15/glorot11a/glorot11a.pdf) for a good paper about these topics.

Dropout

Many research groups in the past few years have been advocating for the use of "dropout" in classifier networks to avoid overtraining. (See for example "Dropout: A simple way to prevent neural networks from overfitting" by Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov http://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf) In dropout, during training, some constant proportion of the units in a given layer are randomly set to 0 for each input that the network processes. This forces the units that aren't set to 0 to "make up" for the "missing" units. Dropout seems to be an extremely effective regularizer for neural network models in classification tasks. See a blog article about this at http://fastml.com/regularizing-neural-networks-with-dropout-and-with-dropconnect/.

You might be interested in reading the following paper by researchers of Microsoft Research:

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun: Deep Residual Learning for Image Recognition on arxiv, 2015.

They had similar problems as you had:

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments.

To solve the problem, they have made use of a skip architecture. With that, they trained very deep networks (1202 layers) and achieved the best result in the ILSVRC 2015 challenge.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top