Why is my loss increasing in gradient descent?

https://datascience.stackexchange.com/questions/73797

11-12-2020
|

Вопрос

When the learning rate is 0.01 the loss seems to be decreasing whereas when I increase the learning rate even slightly, the loss increases. Why does this happen? Are the gradients calculated wrong?

Neural Network with 2 hidden layers ,128 neurons in the first hidden layers and 64 in the second hidden layer. The output layer consists of a single sigmoid Neuron

class FNN:
    def __init__(self):
        self.W1=None
        self.b1=None
        self.W2=None
        self.b2=None
        self.W3=None
        self.b3=None
    def sigmoid(self,x):
        return 1/(1+np.exp(-x))
    def forward_prop(self,x):
        self.Z1=np.dot(self.W1,x)+self.b1
        self.A1=np.tanh(self.Z1)
        self.Z2=np.dot(self.W2,self.A1)+self.b2
        self.A2=np.tanh(self.Z2)
        self.Z3=np.dot(self.W3,self.A2)+self.b3
        self.A3=self.sigmoid(self.Z3)
        return self.A3
    def back_prop(self,x,y):
        self.forward_prop(x)
        m=x.shape[1]
        self.dZ3=self.A3-y
        self.dW3=np.dot(self.dZ3,self.A2.T)/m
        self.db3=np.sum(self.dZ3,axis=1,keepdims=True)/m
        self.dZ2=np.dot(self.W3.T,self.dZ3)*(1-self.A2**2)
        self.dW2=np.dot(self.dZ2,self.A1.T)/m
        self.db2=np.sum(self.dZ2,axis=1,keepdims=True)/m
        self.dZ1=np.dot(self.W2.T,self.dZ2)*(1-self.A1**2)
        self.dW1=np.dot(self.dZ1,x.T)/m
        self.db1=np.sum(self.dZ1,keepdims=True)/m
    def fit(self,x,y,epochs=100,learning_rate=0.01,plot=True,disp_loss=False):
        np.random.seed(4)
        self.W1=np.random.rand(128,x.shape[0])
        self.b1=np.zeros((128,1))
        self.W2=np.random.randn(64,128)
        self.b2=np.zeros((64,1))
        self.W3=np.random.randn(1,64)
        self.b3=np.zeros((1,1))
        m=x.shape[1]
        loss=[]
        for i in range(epochs):
            self.back_prop(x,y)
            self.W1-=learning_rate*self.dW1
            self.b1-=learning_rate*self.db1
            self.W2-=learning_rate*self.dW2
            self.b2-=learning_rate*self.db2
            self.W3-=learning_rate*self.dW3
            self.b3-=learning_rate*self.db3
            logprobs=y*np.log(self.A3)+(1-y)*np.log(1-self.A3)
            cost=-(np.sum(logprobs))/m
            loss.append(cost)
        e=np.arange(1,epochs+1)
        if plot:
            plt.plot(e,loss)
            plt.title('LOSS PLOT')
            plt.xlabel('Epoch')
            plt.ylabel('Loss')
            plt.show()
        if disp_loss:
            print(loss)
    def predict(self,x):
        y=np.where(self.forward_prop(x)>=0.5,1,0)
        return y

F=FNN()
F.fit(x_train,y_train)
y_pred=F.predict(x_train)

Output

Learning Rate:0.01

Learning Rate:1

Решение

Based on your plots, it doesn't seem to be a problem in your case (see my comment). The reason behind that spike when you increase the learning rate is very likely due to the following.

Gradient descent can be simplified using the image below.

Your goal is to reach the bottom of the bowl (the optimum) and you use your gradients to know in which direction to go (in this simplistic case, should you go left or right). The gradient tells you in which direction to go, and you can view your learning rate as the "speed" at which you move. If your learning rate is too small, it can slow down the training. If your learning rate is too high, you might go in the right direction, but go too far and end up in a higher position in the bowl than previously. That's called diverging.

Also, good to note that it could be completely normal that your loss doesn't always decrease. This is particularly true if you use mini-batch gradient descent. In that scenario, your gradient may not always be completely accurate, and you might simply make a step in the wrong direction every once in a while.

I hope this explanation helps!

Лицензировано под: CC-BY-SA с атрибуция

Не связан с datascience.stackexchange