Question

At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.

I assume I should train a separate classifier for each keypoint, and some ideas I have are:

  • iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.

  • instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations

  • the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part

Is there any merit to these ideas, and are there methods I haven't thought of?

Was it helpful?

Solution 2

I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

http://deeplearning.net/tutorial/lenet.html#lenet

I made the following changes to a typical convolutional network:

  • I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score

  • I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.

Here's a github repo with some of the work I did: https://github.com/cowpig/deep_keypoints

Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:

http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

OTHER TIPS

however, that paper describes a system for binary facial recognition

No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.

You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.

Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.

Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top