Keypoint recognition as classification?

Question 1

I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html

http://deeplearning.net/tutorial/lenet.html#lenet

I made the following changes to a typical convolutional network:

I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.

Here's a github repo with some of the work I did: https://github.com/cowpig/deep_keypoints

Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:

http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/

Question 2

however, that paper describes a system for binary facial recognition

No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.

You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.

Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.

Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.