Appropriateness of an artificial neural network in pose estimation

Question

Using a neural network for your application can very well work, however, I feel you will need a lot of training samples to allow the network to generalize. Of course, this also depends on the type and number of poses you're dealing with. It sounds to me that with some clever maths it might be possible to derive the movement vector directly from the input vector -- if by any chance you can come up with a way of doing that (or provide more information so others can think about it too), that would very much be preferred, as in that case you would include prior knowledge you have about the task instead of relying on the NN to learn it from data.

If you decide to go ahead with the NN approach, keep the following in mind:

Divide your data into training and validation set. This allows you to make sure that the network doesn't overfit. You train using the training set and determine the quality of a particular network using the error on the validation set. The ratio of training/validation depends on the amount of data you have. A large validation set (e.g., 50% of your data) will allow more precise conclusions about the quality of the trained network, but often you have too few data to afford this. However, in any case I would suggest to use at least 10% of your data for validation.
As to the number of hidden units, a rule of thumb is to have at least 10 training examples for each free parameter, i.e., each weight. So assuming you have a 3-layer network with 4 inputs, 10 hidden units, and 3 output units, where each hidden unit and the output units have additionally a bias weight, you would have (4+1) * 10 + (10+1) * 3 = 83 free parameters/weights. In general you should experiment with the number of hidden units and also the number of hidden layers. From my experience 4-layer networks (i.e., 2 hidden layers) work better than 3-layer network, but that depends on the problem. Since you also have the validation set, you can find out what network architecture and size works without having to fear overfitting.
For the activation function you should use some sigmoid function to allow for non-linear behavior. I like the hyperbolic tangent for its symmetry, but from my experience you can just as well use the logistic function.