Leaky ReLU inside of a Simple Python Neural Net
-
09-12-2020 - |
Pergunta
To build a simple 1-layer neural network, many tutorials use a sigmoid function as the activation function. According to scholarly articles and other online sources, a leaky ReLU is a better alternative; however, I cannot find a way to alter my code snippet to allow a leaky ReLU.
I tried logic like if x > 0 then x else x/100
as the activation function, then the same for the derivative.
Is it failing because the output layer cannot have a ReLU? Should I change the first layer to ReLU then add a softmax output layer?
import numpy as np
np.random.seed(1)
X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1]])
y = np.array([[0,1,0,1]]).T
class NN:
def __init__(self, X, y):
self.X = X
self.y = y
self.W = np.random.uniform(-1, 1, (X.shape[1], 1))
self.b = np.random.uniform(-1, 1, (X.shape[1], 1))
def nonlin(self, x, deriv=False):
if deriv:
return x*(1-x)
return 1/(1+np.exp(-x))
def forward(self):
self.l1 = self.nonlin(np.dot(self.X, self.W + self.b))
self.errors = self.y - self.l1
print(abs(sum(self.errors)[0]))
def backward(self):
self.l1_delta = self.errors * nonlin(self.l1, True)
self.W += np.dot(self.X.T, self.l1_delta)
self.b += np.dot(self.X.T, self.l1_delta)
def train(self, epochs=20):
for _ in range(epochs):
self.forward()
self.backward()
nn = NN(X, y)
nn.train()
Solução
I assume the task you're working on is a binary classification task since y = np.array([[0,1,0,1]]).T
. And as an error function you use self.errors = self.y - self.l1
. Now compare this to the curve of a leaky ReLU function:
The leaky ReLU is an unbounded function. How is your network supposed to model a binary classification task where output values are elements of $\{0,1\}$ using this function? And what is the result of applying the absolute difference as an error function to your labels $y \in \{0,1\}$ and your outputs $\hat{y} \in (-\infty,\infty)$?
What you would need is a translation of the lReLU outputs to your classes, e.g. something like prediction = 1.0 if activation >= 0.0 else 0.0
.
Which is one reason why it is more common to apply last layer activation functions to classification problems which can be interpreted as class probabilities, e.g. Softmax or Sigmoid:
You still need a threshold here to move from real valued outputs (which are interpreted as probabilities) to binary class labels but that is fairly straightforward since often you just set it to a probability of $0.5$.
I suggest to go follow a tutorial for your from scratch implementation. If you are willing to take a step back here is one for a simple percepton (i.e. just a single neuron).
Images taken from this article