Feeding 3 consecutive video frames to a CNN to track a tennis ball

https://datascience.stackexchange.com/questions/53030

01-11-2019
|

문제

I want to use CNN transfer learning to track a tennis ball from TV broadcasts of tennis matches. I used VGG annotating tool annotation tool link (use version 1 of the tool for compatibility with matterport code) and have about 200 frames annotated with ball location and x,y coordinates given by the tool for the bounding circle.

Like this:

But, the ball is occluded by the bottom player's body or the net tape at times and at others practically invisible because it's moving too fast (other times, it is elliptical in the direction its moving).

A potential solution I saw used is below in an algorithm called Tracknet. Tracknet: Tracknet ball tracking I contacted the creators of it and was told that they would open source but it's been over a year so I want to try to mimic it.

Edit: My Q and A via email with the Tracknet Team:

No.

We concatenate three images that are 9 slices in total and then input them together into a network but keep the rest of the network the same. In such a method, the calculation overhead is only on the first layer.

If you would like to have a design like the figure in your previous email, we would suggest you applying an RNN or LSTM to track the ball.

Best, Tsi-Ui

"9 slices in total"...does this mean RGB for each of 3 consecutive frames?

Also, I know from Karpathy Pong, how to feed difference frames but how would I feed 3 consecutive frames as done in Tracknet?

I know that 2 frames is not enough having manually gone through them and 3 seems to be the minimum required to have at least one visible tennis ball.

Furthermore, I was advised by Adrian Rosebrock of pyimage that I would need trajectory estimation and a high FPS camera so this is another avenue for investigation although Tracknet seemed to do it with neither of these features.

Edit: I am reading the Deep Learning Book chapters on CNNs to learn more about how they process input info at a low level so I can figure out what concatenate 9 slices means. Book chapter: CNN chapter
Just as a thought, I was thinking of calculating a collection of difference frames at t - (t-1), t - (t-2),...,t- (t-n) which might help in approximating ball location.

Edit: I just saw this video by Andrew Ng on outputting real numbers from a nn for object detection and localization: coursera video So, if that is possible, then I can output y =[x_0, y_0, x_1, y_1, x_2, y_2] which are the x, y coordinates of the center of the ball at t, t+1, t+2 and take the mean squared error between y-hat and y for a loss. Note: When feeding the network, I might have overlapping frames, ie. feed [t, t+1, t+2] and then feed [t+1, t+2, t+3] which has the intersection of t+1, t+2. Also for frames for which the ball is too blurry, I am going to train a separate nn that gets inputs of ball location at t_0 and t+n concatenated and outputs ball location at t_j where j is sometime between 0 and n. The use these outputs to annotate the "invisible ball" frames, then use this "complete" set of ground truth frames concatenated as the training set for the main nn tha tracks the ball in video.

올바른 솔루션이 없습니다

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 datascience.stackexchange