Question

on our current project we are using a depth camera mounted on top of the user's head to recognize fingers, hands and touch events. This works quite well and can already be used as a new type of input device.

Our next step is to use augmented reality glasses to display buttons/controls onto the user's palm. For this step we need a transformation of our recognized data (finger tip, corner points of palm quadrangle) to display them onto the correct location on the augmented reality glasses. In the future we will use a real 3D output scene, but for now we are simply displaying a 2D image with our glasses. You can imagine the whole setup as a stereo view with the depth camera and the users eyes as cameras.

To get the transformation matrix we successively display a random point on the output image and the user has to hold his finger tip onto that location. By that we get point correspondences between the input image (depth camera) and the output image (augmented reality glasses). We currently use 20 of these correspondences and then use Emgu's FindHomography() method to get the transformation matrix.

For our first effort this already works ok, but it's not perfect. How should we proceed to get better results?

what we have:

  • 2D pixel coordinates in our input image (depth camera 320x240)
  • 3D coordinates (relative to our depth camera)
  • (corresponding 2D pixel coordinates in output image)

what we need:
A method that maps a 2D pixel coordinate or a 3D coordinate relative to our depth camera to our output image (2D for now, maybe 3D later).

Question:
What type of transformation should we use here? FindHomography(), GetPerspectiveTransformation(), fundamentalMatrix?, essentialMatrix?

Any help/suggestion is greatly appreciated. Thank you in advance!

Was it helpful?

Solution

Fist, FindHomography(), GetPerspectiveTransformation() are the same transformations except the former takes repetitive attempts at later in Ransac framework. They match points between planes and thus aren’t suitable for your 3d task. FundamentalMatrix and essentialMatrix aren’t transformation they are buzz words you heard ;). if you are trying to re-project a virtual object from a camera system into glasses point of view you simply have to apply rotation and translation in 3D to your object coordinates and then re-project them.

The sequence of steps is:
1. find 3D coordinates of the landmark (say your hand) using a stereo camera;
2. place your control close to the landmark in 3D space (some virtual button?);
3. calculate a relative translation and rotation of each of your goggles viewpoints w.r.t stereo camera; for example, you may find that right goggles focal point is 3cm to the right from stereo camera and is rotated 10 deg around y axis or something; importantly left and right goggles focal point will be shifted in space which creates an image disparity during re-projection (the greater the depth the smaller the disparity) which your brain interprets as a stereo cue for depth. Note that there are plenty of other cues for depth (for example blur, perspective, known sizes, vergence of the eyes, etc.) that may or may not be consistent with the disparity cue.
4. apply inverse of viewpoint transformation to the virtual 3d object; for example if the viewer goggles move to the right (wrt stereo camera) it is like an object moved left;
5. project these new 3D coordinates into the image as col=xf/z+w/2 and row=h/2-yf/z; using openGL can help to make projection look nicer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top