First of all, is this procedure correct?
The vector quantization part is ok, but it's rarely used these days. You describe so-called discrete HMMs which nobody uses for speech. If you want continuous HMMs with GMM as probability distribution for emissions you don't need vector quantization.
Then, you focused on less important steps like MFCC extraction but skipped most important parts like HMM training with Baum-Welch and HMM decoding with Viterbi which are way more complex part of the training than initial estimation of the states with vector quantization.
Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?
If you decode speech you usually select the symbols which correspond to parts phonemes perceived by the human. Its traditional to take 3 symbols per phoneme. For example word "one" should have 9 states for 3 phonemes and word "seven" should have 15 states for 5 phonemes. This practice is proven to be effective. Of course you can vary this estimation slightly.