How to train neural network that has different kind of layers

https://datascience.stackexchange.com/questions/14226

16-10-2019
|

Question

If we have MLP then we can easily compute the gradient for each parameters, by computing the gradient recursively begin with the last layer of the network, but suppose I have neural network that consist of different type of layer for instance Input->convolution layer->ReLu->max pooling->fully connected layer->siftmax layer, how do I compute the gradient for each parameters ?

Solution

The different layers you describe can all have gradients calculated using the same back propagation equations as for a simpler MLP. It is still the same recursive process, but it is altered by the parameters of each layer in turn.

There are some details worth noting:

If you want to understand the correct formula to use, you will need to study the equations of back propagation using the chain rule (note I have picked one example worked through, there are plenty to choose from - including some notes I made myself for a now defunct software project).
When feed-forward values overlap (e.g. convolutional) or are selected (e.g. dropout, max pooling), then the combinations are usually logically simple and easy to understand:
- For overlapped and combined weights, such as with convolution, then gradients simply add. When you back propagate the the gradients from each feature "pixel" in a higher layer, they add into the gradients for the shared weights in the kernel, and also add into the gradients for the feature map "pixels" in the layer below (in each case before starting calculation, you might create an all-zero matrix to sum up the final gradients into).
- For a selection mechanisms, such as the max pooling layer, you only backprop the gradient to the selected output neuron in the previous layer. The others do not affect the output, so by definition increasing or decreasing their value has no effect - they have a gradient of 0 for the example being calculated.
In the case of a feed-forward network, each layer's processing is independent from the next, so you only have a complex rule to follow if you have a complex layer. You can write the back propagation equations down so that they relate gradients in one layer to the already-calculated gradients in the layer above (and ultimately to the loss function evaluated in the output layer). It doesn't directly matter what the activation function was in the output layer after you backpropagate the gradient from it - at that point the only difference is numeric, the equations relating deeper layer gradients to each other do not depend on the output at all.
Finally, if you want to just use a neural network library, you don't need to worry much about this, it is usually just done for you. All the standard activation functions and layer architectures are covered by existing code. It is only when creating your own implementations from scratch, or when making use of unusual functions or structure, that you might need to go as far as deriving the values directly.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange