Why are RNNs used in some computer vision problems?

https://datascience.stackexchange.com/questions/77242

12-12-2020
|

문제

I am learning computer vision. When I was going through implementations of various computer vision projects, some OCR problems used GRU or LSTM, while some did not. I understand that RNNs are used only in problems where input data is a sequence, like audio or text.

So, in kernels of MNIST on kaggle almost no kernel has used RNNs and almost every repository for OCR on IAM dataset on GitHub has used GRU or LSTMs. Intuitively, written text in an image is a sequence, so RNNs were used. But, so is the written text in MNIST data. So, when exactly is it that RNNs(or GRUs or LSTMs) need to be used in computer vision and when don't?

해결책

RNNs and CNNs are not mutually exclusive! It might seem that they are used to handle different problems, but it is important to note that some types of data can be processed by either architecture. For instance, RNNs uses the sequences as the input. It should be mentioned that sequences are not just limited to text or music. Sequences can also be videos, which are a set of images.

RNNs, such as LSTM, are used for cases where the data includs temporal properties, e.g., time series, and also where the data is context-sensitive, e.g., in the case of sentence completion, the function of memory provided by the feedback loops is critical for adequate performance. In addition, RNNs have been successfully applied in the following areas of computer vision:

Image classification (one-to-one RNN): e.g., “Daytime picture” versus “Nighttime picture”.
Image captioning (One-to-many RNN): giving a caption to an image based on what is being shown. For example, “Fox jumping over dog”.
Handwriting recognition: Please read this [paper] (https://arxiv.org/pdf/1902.10525.pdf)

Regarding CNN, here are some of its applications:

Medical image analysis
Image recognition
Face detection
Recognition systems
Full-motion video analysis. It is important to know that CNNs are not capable of handling a variable-length input.

Finally, using RNNs and CNNs together is possible and it could be the most advanced use of computer vision. For example, a hybrid RNN and CNN approach may be superior when the data is suitable for a CNN, but has temporal characteristics.

Ref

다른 팁

When I think about RNNs applied to Computer Vision, two main research areas of Deep Learning come up to my mind:

Image Captioning: Neural Networks trained to produce descriptions of images. In that case, you have a Conv ecoder that processed pixel data, and an RNN decoder that produces a description.
Video processing (I don't have a better term). Anything regarding video data requires a mixture of CNNs and RNNs. In general: you process each frame with Conv layers, and the sequence of processed image frames is then fed into some RNN layer that performs the final task, that could be classification, segmentation, self driving cars, ... you name it.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange