I have worked on a similar dynamic hand gesture recognition project (although using a simpler webcam and not a Kinect). In my case, I categorized my gestures into classes say, Left, Right, Circular-Clockwise, Circular-AntiClockwise...etc. Since you would be taking angles between consecutive points into account, that would be your Observation Sequence. As for the states, there may not necessarily always be a logical relation between your States and Observation. I was working with 8 gestures. Now, I had about 12 observation symbols for each input pattern but the no. of states for each class was different. For example: Left : 2 States Right : 3 States Circle clockwise : 4 States etc.
The advantage was that from the State Sequence output I got from Viterbi algo., I could directly get the maximum state number and hence, my Class. Also, during the learning phase, my Baum-Welch implementation automatically learnt the classes depending on the no. of states. You could refer to my blog post [which has a description of my approach to recognizing gestures using HMM as I did in my project] for addition information. I hope it helps you.