I heard the neural network has a problem with vanishing gradient problems even though the ReLU activation function is used.

In ResNet(that has a connecting function for reducing the problem), there is limit of maximum layers of 120~190(I heard).

For the complete AI performance(or general AI with strong intelligence) I believe that the limit of the number of layers must be solved.

Is there is any possibility that we find a new activation function that does not limit the number of layers? (maybe we could use exhaustive search... checking the train performance in a neural network of 200~500 layers)

有帮助吗?

解决方案

In recent years, the problem of vanishing/exploding gradients is not causing a lot of trouble anymore. It's still something you should care about, but all the tools and tricks that have been developed in the last 5-7 years have dissipated a lot of worries.

Today, using activations from the ReLU family, combined with batchnorm, dropout, and other regularization techniques such as a good parameters initialization have made this problem not so scary anymore.

At this point, the number of hidden layers depends from other main factors:

  1. The computaional power available, of course.

  2. The complexity of your dataset. If the signal is very easy and can be learned in few epochs, too many parameters means none of them is trained enough (since the error would be backpropagated to them, and each of them would receive too little update). In Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Aurelien Géron said that sometimes we don't have to worry too much about model size, just implement a very powerful network, and use early stopping to train it just as much you need. That's a different way of tackling the problem.

In light of that, coming to your observations:

For the complete AI performance(or general AI with strong intelligence) I believe that the limit of the number of layers must be solved.

I think there is no right number of hidden layer, strictly speaking. It's not a result that you can find with a mathematical formula. Under many aspects, Deep Learning is more an art than a science, and as I exposed above it can be solved in more than one way.


Is there is any possibility that we find a new activation function that does not limit the number of layers?

Activation function only a little piece of a large mosaic. It's not up to an activation function to solve the issue. The research on new activation function is still active though, and very interesting to follow.

其他提示

The maximum number of layers that a network can have depends on many factors, such as the model architecture, activation function, optimization method, and others. For example, as you alluded to, ResNets is huge in increasing the number of layers possible, even up to 1000 layers for CIFAR-10. Another example is that when ReLU functions became more popular, Kaiming initialization allowed deeper networks to be trained by keeping the variance of each layer's activations at around 1, as the other popular method of Xavier initialization is more suitable for sigmoid activation functions.

So in general, there's no known limit except to keep experimenting and coming up with new techniques, including but not limited to finding better activation functions.

许可以下: CC-BY-SA归因
scroll top