Question

I want to ask question regarding the action detection on the video with proposed frames. I've used Temporal 3D ConvNet for the action recognition on video. Successfully trained it and can recognize action on videos.

When i do inference, i just collect 20 frames from video, feed it to model and it gives me result. The point is that events on different videos are not similar size. Some of them cover 90% of the frame, but some may 10%. Let's take as example that two objects collided and it can happen in different scale, and i want to detect this action.

How provide to model exact position for the action recognition, if it can happen on different scale with different objects? What comes in mind is to use Yolo to collect Regions of Interest and feed collected frames every time the 3D convnet. But if there are a lot of objects, the speed will be very slow. How to handle it? Is there any end-to-end solutions for the action recognition with the object location proposal for the action recognition network? I've already looked papers and blogs, what people suggest, couldn't find solution for the localization issues, so action recognition model got correct frames.

Any advise from you? Maybe someone may explain me approach?

Thank you

Regards, Dmitry

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top