Activation Functions in Neural network

https://datascience.stackexchange.com//questions/63095

29-11-2019
|

Pergunta

I have a set of questions related to the usage of various activation functions used in neural networks. I would highly appreciate if someone could give explanatory answers.

Why is ReLU is used only on hidden layers specifically?
Why is Sigmoid not used in multi-class classification?
Why do we not use any activation function in regression problems having all negative values?
Why do we use average='micro' while calculating the performance metric in multi_class classification?
- f1-score(y_pred,y_test,average='micro')

Solução

I'll go through your questions one by one.

1.Why ReLU is used only on hidden layers specifically?

It's not necessarily used on hidden states only. ReLU work much better than "older" activation functions (such as Sigmoid and Tanh) because they backpropagate the error much better than their counterparts. All the most powerful activation functions are ReLU or some version of ReLU.

The typical ML tasks (classification and regression) do not require ReLU activation at the output layer, because of the nature of the task itself. However, sometimes regression do in fact require ReLU's. Let me make an example: I once trained an RNN to predict pollution levels, based on data from the last 24 hours. Since pollution levels cannot, by definition, go below zero, I used ReLU as the activation output for my regressor. In this way, you force your model to never go below zero with its prediction. House price prediction is another example in which you can use ReLU's at the output layers.

2.Why Sigmoid is a not used in Multi-class classification?

You need Softmax, because that returns a vector in which the activation of the final nodes sums up to one. In this way, you can explain each coeffient as "the probability of belonging to class X". A typical Softmax output is something like:

[ 0.3 , 0.5 , 0.2 ]

If you use Sigmoid instead, all nodes would be independent from each other. You could get results like:

[ 0.9 , 0.9 , 0.9 ], or: [ 0.1 , 0.1 , 0.1 ]

which doesn't make a lot of sense for a classifier, and it cannot be interpreted as a vector of probabilities.

(You can use Sigmoid in case of binary classification with a single output node, that's the only case in which it would work.)

3.Why we do not use any activation function in regression problems having all negative values?

Because, I think, there's no need to do it. I don't see any practical usefulness coming from it.

4.Why we use "average='micro'" while calculating performance metric in multi_class classification? Ex:- f1-score(y_pred,y_test,average='micro')

Unfortunately there is no rule of thumb here. You should try different metrics and pick the one that fits your task the best.

Outras dicas

Let me try with an explanation for at least 1+2+4:

usually when talking about activation functions in your output layer you want to achieve on of two things: either have a binary output (0 or 1) for classification tasks (sometimes a soft output for probabilities like softmax) or a linear output for regression tasks. ReLU while being used "like" a linear function isn't really one so that you would only get 0 to -inf for negative values. Since your output hardly ever is something with 0 to -inf and linear for positive values, ReLU doesn't really make sense as output activation
Sigmoid is not used in a Multi_class classification because it just gives you any value between 0 an 1. Usually you want something that either tells you binary 0 or 1 per class or gives you real probabilities per class like with Softmax.
I'm not totally clear what you're asking
Micro Averages are computed by taking all contributions of all classes into consideration and calculating their average. This is preferable if you suspect an imbalance between your classes (since it aggregates all class-contributions rather than looking at them individually; this btw. would be 'macro')

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange