How do subsequent convolution layers work?

https://datascience.stackexchange.com/questions/9175

16-10-2019
|

Question

This question boils down to "how do convolution layers exactly work.

Suppose I have an $n \times m$ greyscale image. So the image has one channel. In the first layer, I apply a $3\times 3$ convolution with $k_1$ filters and padding. Then I have another convolution layer with $5 \times 5$ convolutions and $k_2$ filters. How many feature maps do I have?

Type 1 convolution

The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n \times m$. Every single pixel was created by taking $3 \cdot 3 = 9$ pixels from the padded input image.

Then the second layer gets applied. Every single filter gets applied separately to each of the feature maps. This results in $k_2$ feature maps for every of the $k_1$ feature maps. So there are $k_1 \times k_2$ feature maps after the second layer. Every single pixel of each of the new feature maps got created by taking $5 \cdot 5 = 25$ "pixels" of the padded feature map from before.

The system has to learn $k_1 \cdot 3 \cdot 3 + k_2 \cdot 5 \cdot 5$ parameters.

Type 2.1 convolution

Like before: The first layer gets executed. After that, I have $k_1$ feature maps (one for each filter). Each of those has the size $n \times m$. Every single pixel was created by taking $3 \cdot 3 = 9$ pixels from the padded input image.

Unlike before: Then the second layer gets applied. Every single filter gets applied to the same region, but all feature maps from before. This results in $k_2$ feature maps in total after the second layer got executed. Every single pixel of each of the new feature maps got created by taking $k_2 \cdot 5 \cdot 5 = 25 \cdot k_2$ "pixels" of the padded feature maps from before.

The system has to learn $k_1 \cdot 3 \cdot 3 + k_2 \cdot 5 \cdot 5$ parameters.

Type 2.2 convolution

Like above, but instead of having $5 \cdot 5 = 25$ parameters per filter which have to be learned and get simply copied for the other input feature maps, you have $k_1 \cdot 3 \cdot 3 + k_2 \cdot k_1 \cdot 5 \cdot 5$ paramters which have to be learned.

Question

Is type 1 or type 2 typically used?
Which type is used in Alexnet?
Which type is used in GoogLeNet?
- If you say type 1: Why do $1 \times 1$ convolutions make any sense? Don't they only multiply the data with a constant?
- If you say type 2: Please explain the quadratic cost ("For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation")

For all answers, please give some evidence (papers, textbooks, documentation of frameworks) that your answer is correct.

Bonus question 1

Is the pooling applied always only per feature map or is it also done over multiple feature maps?

Bonus question 2

I'm relatively sure that type 1 is correct and I got something wrong with the GoogLe paper. But there a 3D convolutions, too. Lets say you have 1337 feature maps of size $42 \times 314$ and you apply a $3 \times 4 \times 5$ filter. How do you slide the filter over the feature maps? (Left to right, top to bottom, first feature map to last feature map?) Does it matter as long as you do it consistantly?

My research

I've read the two papers from above, but I'm still not sure what is used.
I've read the lasagne documentation
I've read the theano documentation
I've read the answers on Understanding convolutional neural networks (without following all links)
I've read Convolutional Neural Networks (LeNet). Especially figure 1 makes me relatively sure that Type 2.1 is the right one. This would also fit to the "quadratic cost" comment in GoogLe Net and to some practical experience I had with Caffee.

Solution

I am not sure about the alternatives described above, but the commonly used methodology is:

Before the application of the non-linearity, each filter output depends linearly on all of the feature maps before within the patch, so you end up with $k_2$ filters after the second layers. The overall number of parameters is $3 \dot{} 3\dot{}k_1 + k_1\dot{} 5 \dot{} 5 \dot{} k_2$.

Bonus 1: Pooling is done per feature map, separately.

Bonus 2: The order of "sliding" does not matter. In fact, each output is computed based on the previous layer, so the output filter responses do not depend on each other. They can be computed in parallel.

OTHER TIPS

I have just struggled with this same question for a few hours. Thought I'd share the insite that helped me understand it.

The answer is that the filters for the second convolutional layer do not have the same dimensionality as the filters for the first layer. In general, the filter has to have the same number of dimensions as its inputs. So in the first conv layer, the input has 2 dimensions (because it is an image). Thus the filters also have two dimensions. If there are 20 filters in the first conv layer, then the output of the first conv layer is a stack of 20 2D feature maps. So the output of the first conv layer is 3 dimensional, where the size of the third dimension is equal to the number of filters in the first layer.

Now this 3D stack forms the input to the second conv layer. Since the input to the 2nd layer is 3D, the filters also have to be 3D. Make the size of the second layer's filters in the third dimension equal to the number of feature maps that were the outputs of the first layer.

Now you just convolve over the first 2 dimensions; rows and columns. Thus the convolution of each 2nd layer filter with the stack of feature maps (output of the first layer) yields a single feature map.

The size of the third dimension of the output of the second layer is therefore equal to the number of filters in the second layer.

Check this lecture and this visualization

Usually it is used type 2.1 convolution. In the input you have NxMx1 image, then after first convolution you will obtain N_1xM_1xk_1, so your image after first convolution will have k_1 channels. The new dimension N_1 and M_1 will depend on your stride S and padding P: N_1 = (N - 3 + 2P)/S + 1, you compute M_1 in analogy. For the first conv layer you will have 3x3xk_1 + k_1 weights. There is added k_1 for biases in nonlinear function.

In the second layer you have as an input image with size N_1xM_1xk_1, where k_1 is new number of channels. And after second convolution you obtain N_2xM_2xk_2 image (array). You have 5x5xk_2xk_1+k_2 parameters in the second layer.

For 1x1 convolution with k_3 filters and input NxMxC (C is number of input channels) you will obtain new image (array) NxMxk_3, so 1x1 make sense. They were introduced in this paper

Bonus 1: pooling is applied per feature map.

For details please see slides for CNN course on Stanford - you have there nice visualisation how convolution is summed from several input channels.

The first layer consists of $k_1$ kernels with size $3 \cdot 3 \cdot 1$ to give $k_1$ feature maps which are stacked depth-wise.

The second layer consists of $k_2$ kernels with size $5 \cdot 5 \cdot k_1$ to give $k_2$ feature maps which are stacked depth-wise.

That is, the kernels in a convolutional layer span the depth of the output of the previous layer.

A layer with $1 \times 1$ convolutional layer actually has $k_n$ kernels of size $1 \cdot 1 \cdot k_{n-1}$.

Speculation:

Bonus question 2 is not something I'm familiar with, but I will guess the depth parameter in the convolution becomes an extra dimension.

e.g. If the output of a layer is size $m \cdot n \cdot k_{n}$, a 3D convolution with padding would result in an output of size $m \cdot n \cdot k_{n+1} \cdot k_{n}$

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange