Deconvolutional Network in Semantic Segmentation

https://datascience.stackexchange.com/questions/8999

16-10-2019
|

Question

I recently came across a paper about doing semantic segmentation using deconvolutional network: Learning Deconvolution Network for Semantic Segmentation http://arxiv.org/abs/1505.04366.

The basic structure of the network is like this:

The goal is to generate a probability map in the end. I'm having trouble figuring out how to realize the deconvolution layer. In the paper, it says:

The output of an unpooling layer is an enlarged, yet sparse activation map. The deconvolution layers densify the sparse activations obtained by unpooling through convolution-like operations with multiple learned filters. However, contrary to convolutional layers, which connect multiple input activations within a filter window to a single activation, deconvolutional layers associate a single input activation with multiple outputs.

The output of the deconvolutional layer is an enlarged and dense activation map. We crop the boundary of the enlarged activation map to keep the size of the output map identical to the one from the preceding unpooling layer.

The learned filters in deconvolutional layers correspond to bases to reconstruct shape of an input object. Therefore, similar to the convolution network, a hierarchical structure of deconvolutional layers are used to capture different level of shape details. The filters in lower layers tend to capture overall shape of an object while the class-specific fine details are encoded in the filters in higher layers. In this way, the network directly takes class-specific shape information into account for semantic segmentation.

Can anyone explain how the deconvolution works? I'm guessing it's not a simple interpolation.

Solution

There are two main functions they undo.

The pooling layers in the convolutional neural network downsample the image by (usually) taking the maximum value within the receptive field. Each rxr image region is downsampled to a single value. What this implementation does is store which unit had the maximum activation in each of these pooling steps. Then, at each "unpooling" layer in the deconvolutional network, they upsample back to a rxr image region, only propagating the activation to the location that produced the original max-pooled value.

Thus, "the output of an unpooling layer is an enlarged, yet sparse activation map."

Convolutional layers learn a filter for each image region that maps from a region of size r x r to a single value, where r is the receptive field size. The point of the deconvolutional layer is to learn the opposite filter. This filter is a set of weights that projects an rxr input into a space of size sxs, where s is the size of the next convolutional layer. These filters are learned in the same way that as regular convolutional layers are.

As the mirror image of a deep CNN, the low-level features of the network are really high-level, class-specific features. Each layer in the network then localizes them, enhancing class-specific features while minimizing noise.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange