Why activation functions used in neural networks generally have limited range?

https://datascience.stackexchange.com//questions/62881

29-11-2019
|

Pergunta

Why do we generally use activation functions with only limited range in neural networks? for e.g.

$sigmoid$ activation function has range $[0, 1]$
$tanh$ activation function has range $[-1, 1]$

Q1) Suppose I use some other non-linear activation function like $f(x)=x^2$, that don't have any such limited range then what can be potential problems in training such a neural network?

Q2) Suppose I use $f(x)=x^2$ as activation function in neural network and I am Normalizing the layers (to avoid values keep multiplying to higher values) then would such a Neural Network work? (This is again in reference to the question I posted in heading that "Why do we generally use activation functions with only limited range in neural networks?")

Solução

The main goal of an Activation function is to add non-linearity but at the same time

It must not explode the large inputs else time to reach minima will become very long
It should have a smooth gradient, this will help the training keep going towards minima without being stuck

With your example function -

We will achieve non-linearity but gradient will be very large for big value and it will slow down the training

I believe you guessed this, so you asked the next question. What if I normalize the data -

Normalization helps in bringing two different features at the same level. But the larger value of the same feature will still be larger compared to the smaller one

But such function stops learning at both the extreme due to flattening gradient. Also, they require high computation We have RELU without an upper limit. It's a good trade-off on all the mentioned points

Outras dicas

As far as I know, the reason first activations were chosen this way can be traced back to the electrical properties of a neuron. There must be a minimum electrical potential for a neuron to transmit an incoming signal. This can be mathematically expressed by a step function, in which any value above the threshold is equal to one, otherwise its zero. All we need to know is if the the potential is smaller or larger than a given threshold value - if the switch is on or off - So there is no need to know "how much" it is on or off.

Sigmoid function is in a sense differentiable version of the step function. And tanh also has the same structure, but just shifted down a little. But modern activations are not limited: ReLU and its variations Leaky ReLU and elu have no upper limit.

But they all share a common property that is they are monotonically increasing - they either increase or stay constant with increasing input. So I suppose you will be OK as long as the function you choose shares this property (how fast will it converge is whole another question). Your example of f(x) = x^2 does not have this property, for example -0.5 and 0.5 have the same value, which I suppose will cause problems. This is against the idea of a how a neuron works, but of course we are not bounded by biology in constructing Neural Networks.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange