Pergunta

To build a simple 1-layer neural network, many tutorials use a sigmoid function as the activation function. According to scholarly articles and other online sources, a leaky ReLU is a better alternative; however, I cannot find a way to alter my code snippet to allow a leaky ReLU.

I tried logic like if x > 0 then x else x/100 as the activation function, then the same for the derivative.

Is it failing because the output layer cannot have a ReLU? Should I change the first layer to ReLU then add a softmax output layer?

import numpy as np
np.random.seed(1)

X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1]])
y = np.array([[0,1,0,1]]).T

class NN:
  def __init__(self, X, y):
    self.X = X
    self.y = y
    self.W = np.random.uniform(-1, 1, (X.shape[1], 1))
    self.b = np.random.uniform(-1, 1, (X.shape[1], 1))

  def nonlin(self, x, deriv=False):
    if deriv:
      return x*(1-x)
    return 1/(1+np.exp(-x))

  def forward(self):
    self.l1 = self.nonlin(np.dot(self.X, self.W + self.b))
    self.errors = self.y - self.l1
    print(abs(sum(self.errors)[0]))

  def backward(self):
    self.l1_delta = self.errors * nonlin(self.l1, True)
    self.W += np.dot(self.X.T, self.l1_delta)
    self.b += np.dot(self.X.T, self.l1_delta)

  def train(self, epochs=20):
    for _ in range(epochs):
      self.forward()
      self.backward()

nn = NN(X, y)
nn.train()
Foi útil?

Solução

I assume the task you're working on is a binary classification task since y = np.array([[0,1,0,1]]).T. And as an error function you use self.errors = self.y - self.l1. Now compare this to the curve of a leaky ReLU function:

Leaky ReLU

The leaky ReLU is an unbounded function. How is your network supposed to model a binary classification task where output values are elements of $\{0,1\}$ using this function? And what is the result of applying the absolute difference as an error function to your labels $y \in \{0,1\}$ and your outputs $\hat{y} \in (-\infty,\infty)$?

What you would need is a translation of the lReLU outputs to your classes, e.g. something like prediction = 1.0 if activation >= 0.0 else 0.0.

Which is one reason why it is more common to apply last layer activation functions to classification problems which can be interpreted as class probabilities, e.g. Softmax or Sigmoid:

enter image description here

You still need a threshold here to move from real valued outputs (which are interpreted as probabilities) to binary class labels but that is fairly straightforward since often you just set it to a probability of $0.5$.

I suggest to go follow a tutorial for your from scratch implementation. If you are willing to take a step back here is one for a simple percepton (i.e. just a single neuron).


Images taken from this article

Licenciado em: CC-BY-SA com atribuição
scroll top