In this example it seems to me that (q) is the hidden state sequence, so a list of numbers from 0->9. If you have two audio samples say, test and train, and you generate two sequences q_test and q_train, then thinking about |q_test - q_train|, where the norm is componentwise distance, is not useful because it isn't representing a notion of distance correctly, since hidden state labels in HMM may be arbitrary.
A more natural way to think about distance may be the following, given q_train, you are interested in the probability that your test sample took that same path, which you can compute once you have the transition matrix and emission probabilites.
Please let me know if I am misunderstanding your question.