How was the σ function chosen to extend the perceptron?

https://cs.stackexchange.com/questions/129386

29-09-2020
|

Question

I am just reading about perceptrons in more depth, and now onto Sigmoid Neurons.

Some quotes:

A small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1..... That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn. We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.

Just like a perceptron, the sigmoid neuron has weights for each input, $w1,w2,…$, and an overall bias, b. But the output is not 0 or 1. Instead, it's $σ(w⋅x+b)$, where σ is called the sigmoid function and is defined by: $σ(z)≡\frac{1}{1+e^{−z}}$.

If σ had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be 1 or 0 depending on whether w⋅x+b was positive or negative. By using the actual σ function we get, as already implied above, a smoothed out perceptron. The smoothness of σ means that small changes Δwj in the weights and Δb in the bias will produce a small change Δoutput in the output from the neuron. In fact, calculus tells us that Δoutput is well approximated by:

$$Δoutput≈∑_j\frac{∂output}{∂w_j}Δw_j+\frac{∂output}{∂b}Δb$$

Don't panic if you're not comfortable with partial derivatives!

Δoutput is a linear function of the changes $Δw_j$ and $Δb$ in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.

In fact, later in the book we will occasionally consider neurons where the output is f(w⋅x+b) for some other activation function f(⋅). The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5) change. It turns out that when we compute those partial derivatives later, using σ will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, σ is commonly-used in work on neural nets, and is the activation function we'll use most often in this book. [END]

The first part of my question is, how did they know to pick this "sigmoid shaped" function/equation in the first place? How did they know to pick this one over every other curved or not-curved function? Is that just standard practice for these types of problems in Math class? If I were to try and explain why the sigmoid function was chosen, I would say "because it means you can make small changes to the input correspond to small changes to the output." But how? I don't follow the partial derivative math and don't have a background in partial derivatives (and neither does my audience). Knowing why and how th esigma function was chosen would help demystify why neural networks work.

Unfortunately the partial derivatives weren't explained (maybe they will be somewhere else).

The second part of my question is, How is $Δoutput$ a "linear function"? Why not just a flat slope instead of the sigmoid shape. Why does it have to be so fancy? How does "using σ will simplify the algebra"? Where can I find research papers on the original thinking behind this, or if you know the answer then how can you explain why using sigma will simplify the algebra? This seems like an important part of the explanation on why we are using sigma functions in the first place, so having a laymans explanation would really help.

Solution

Answer to first part

The function in the question is called the logistic function. Sometimes it is also called "the" sigmoid function, but some authors use sigmoid to just mean any s-shaped function.

There are a wide variety of activation functions used in practice in neural networks, sigmoid and otherwise. The logistic function is one of the more common ones, because both the logistic function and its derivative are defined for all real inputs, can be written as short expressions in terms of elementary functions, and can be computed efficiently using standard library functions in most programming languages. (This is unlike the step function used as the activation function for a classic perceptron—the derivative of the step function is undefined at the discontinuity.) Another widely-used activation function that has these properties is $\tanh$. There is really no strong reason to prefer one over the other when initially presenting sigmoid neurons. You can pretty much pick any function you learn how to differentiate in a Calculus 1 class and that has a sigmoid shape with asymptotes at $\pm\infty$. They have slightly different performance characteristics in training, but that is not very relevant for an initial explanation.

It is unfortunately very difficult to explain backpropagation without understanding partial derivatives, since backpropagation is literally just gradient descent where the gradient is computed by automatic differentiation. I would recommend watching 3Blue1Brown's excellent series of YouTube videos on how backpropagation works: part 1, part 2, and especially part 3 and part 4.

You mention an audience in the question. If you are going to be presenting this material, I would seriously consider referring your audience to the 4 videos linked above, at least as references. They certainly provide a better explanation than I could give in a lecture.

Answer to second part

The reason for not using a linear function is that a neural network with all linear activation functions is a linear combination of linear functions, and is therefore itself a linear function. So using a linear activation function misses the entire point of training a neural network; you could get the same result faster by doing a least-squares fit of a linear function to the data.

To oversimplify only slightly: a neural network with a linear activation function is just the "fit trendline" feature in Excel.

By contrast, there is a universal approximation theorem that says that, for sufficiently nice nonlinear activation functions, any function can be approximated well by using enough neurons.

The universal approximation theorem was not discovered for many years after when neural networks were first invented, so it was not a motivating factor in their invention. Early neural network research was mainly inspired by biological neurons (the kind in your brain) and control theory.

While the universal approximation theorem says that a sufficently large neural network has the potential to approximate any function well, the actual reason why the standard method of training neural networks (stochastic gradient descent backpropagation) performs so well in practice is still poorly understood and an active area of research.

Licensed under: CC-BY-SA with attribution

Not affiliated with cs.stackexchange