문제

I've got a faster-rcnn (resnet-101 backbone) for object detection, and am extracting feature tensors for each detected object, which is a 7x7x2048 tensor (basically 2048 feature maps, each 7x7). For object tracking, I want to turn this into a Nx1 vector. What is the standard way to do this? I have a few ideas that all seem reasonable:

  • Flatten each feature map, and then concatenate all these together (so each feature vector would be 49*2048 x 1)?
  • Do the same after doing some max pooling operation to decrease dimensionality.
  • Take the mean or max of each feature map, and end up with a 2048x1 feature vector.
도움이 되었습니까?

해결책

After asking around about this, it seems the third option is the standard: take the mean of each feature map, and create a 2048-element feature vector. The search term for this is global pooling. That's what we are talking about that is the terminology I was missing.

Global average pooling is good b/c it reduces the dimensionality before classification. Also, by the time you have reached this level of abstraction in your feature extractor, you are probably not that interested in the finer-grained spatial aspects of the features because you are already representing a very abstract "object" (depending on how deep your feature extractor is) after a great deal of pooling of your original image.

That said, if you are worried about losing spatial information you can always try one of the other two options. I'd be curious if anyone has tried this if they have experiences to share about how they went about it.

Also, for a really nice summary from a paper that addresses this topic, see this answer from crossvalidated:
https://stats.stackexchange.com/a/308218/17624

IN particular:

Conventional convolutional neural networks perform convolution in the lower layers of the network. For classification, the feature maps of the last convolutional layer are vectorized and fed into fully connected layers followed by a softmax logistic regression layer. This structure bridges the convolutional structure with traditional neural network classifiers. It treats the convolutional layers as feature extractors, and the resulting feature is classified in a traditional way.

However, the fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network. Dropout is proposed by Hinton et al as a regularizer which randomly sets half of the activations to the fully connected layers to zero during training. It has improved the generalization ability and largely prevents overfitting.

In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input. We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than GLMs.

Which is from this paper: https://arxiv.org/pdf/1312.4400.pdf

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 datascience.stackexchange
scroll top