Question

In pytorch, we use:

nn.conv2d(input_channel, output_channel, kernel_size)

in order to define the convolutional layers.

I understand that if the input is an image which has size $\text{width} \times \text{height} \times 3$ we would set the input_channel = 3. I am confused, however, what if I have a data set that has dimension: $3 \times 3 \times 30$ or $30 \times 4 \times 5$?

Which number should I use to define the input_channel for these?

Thanks in advance.

Was it helpful?

Solution

The defining factor is which dimensions you want your 2-dimensional convolution sweep over, e.g.:

  • In images, you want the 2D convolution to sweep over the height and width dimensions, and the extra dimension (the color space) is the channels; for grayscale images, you have a single channel.

  • In a spectrogram, you want the 2D convolution to sweep over the time and frequency dimensions. As there are no further dimensions, there is only one channel, like with grayscale images.

In the cases you propose, e.g. "3 * 3 * 30", if we want the 2D convolution to happen in the two first dimensions, then the number of input channels would be 30. If we wanted the 2D convolution to sweep over two other dimensions, then the remaining one would be the number of input channels. The same for "30 * 4 * 5".

We should note, however, that 2D convolutions follow a strict convention in the ordering of dimensions. As described in the pytorch documentation, the convention is $(N,C_{in},H,W)$, which means that we should rearrange the dimensions in our input tensor (e.g. with torch.Tensor.permute) to ensure that the dimensions over which we want the 2D convolution to sweep are in the correct order (i.e. the last 2 dimensions).

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top