In backpropagation, scale is also important?

https://datascience.stackexchange.com/questions/74005

11-12-2020
|

题

I think backpropagation is needed to find the direction of gradient decent method.

I also wonder, the scale is also important?

I heard some issue of vanishing(or exploding) gradient problem. If the direction of the backpropagation is remained, we still could apply gradient decent method(cause we still know the direction to update) and could finally get optimal answer.

If I'm correct, there is no real limit in deep learning? I mean, although the speed may be slow, we can always finish the train work of neural network?

解决方案

In the process of backpropagation,first of all we feed forward the model.Each node receives the input from its previous layer,calculate the weighted suma and the apply the activation function to the weighted sum and pass this result to next layer as input. When we got the result.

Lets say we are classifying animals images,the each output node will correspond to the different category more or less and node with the largest output will be considered as the output of the model.

Then the loss is calculated on the result,lets say we are using SGD for the loss function.for SGD the loss is calculated as "d(loss)/d(weights)".Here Backpropagation comes into play.

Since the output is calculated with the weight and input,to change the result we need to change these two values(weight and input).Since we can't change the input ,we will change the weights.These weights are updates as

                "new weight=old weight - (gradient*learning_rate)"

Here is the role of scale

For sake of simplicity,lets say for a particular connection weight is initialized to 0.833 and its optimal value is 0.437

Also,Gradient w.r.t to a particular weight in a network is the product of some components that resides later in the network(These components are gradients w.r.t. to weights) Therefore,the earlier the weight ,more number of components will be required to calculate its gradients.

1.Vanishing Gradient:

If we set our learning rate(or considered as scale) to 0.0001.

                              "gradient*learning_rate"

This value will be very small. It will take a lot of time to reach the optimal state and at some point will be stuck in a point where the there is almost no change in the weight.

2.Exploding Gradient:

If we set our learning rate(or considered as scale) to 0.01.

                              "gradient*learning_rate"

The scale will be larger enough to reach the optimal value for weight and therefore the optimal value will be skipped. for simplicity lets say gradient is 1

                "new weight=old weight - (gradient*learning_rate)"

new weight=0.833-0.01=0.823

new weight=0.823-0.01=0.813

new weight=0.453-0.01=0.443

new weight=0.443-0.01=0.433

new weight=0.433-0.01=0.423

It can be seen that the optimal value is skipped. That is why the scale is important and should be used carefully.

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange