Simple speech recognition from scratch

https://stackoverflow.com/questions/23486554

16-07-2023
|

Question

The most alike question I found related to my question is this (simple speech recognition methods) but since had passed 3 years and the answers are not enough I will ask.

I want to compute, from scratch, a simple speech recognition system, I only need to recognize five words. As much as I know, the more used audio features for this application are the MFCC, and HMM for classification.

I'm able to extract the MFCC from audio but I still have some doubts about how to use the features for generating a model with HMM and then perform classification.

As I understand, I have to perform vector quantization. First I need to have a bunch of MFCC vectors, then apply a clustering algorithm to get centroids. Then, use the centroids to perform vector quantization, this means that I have to compare every MFCC vector and label it with the name of the centroid most alike.

Then, the centroids are the 'observable symbols' in the HMM. I have to introduce words to the training algorithm and create a HMM model for each word. Then, given an audio query I compare with all models and I say is the word with the highest probability.

First of all, is this procedure correct? Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

Note: I don't want to use sphinx, android API, microsoft API or other library.

Note2: I would appreciate if you share more recent information for better techniques.

Solution

First of all, is this procedure correct?

The vector quantization part is ok, but it's rarely used these days. You describe so-called discrete HMMs which nobody uses for speech. If you want continuous HMMs with GMM as probability distribution for emissions you don't need vector quantization.

Then, you focused on less important steps like MFCC extraction but skipped most important parts like HMM training with Baum-Welch and HMM decoding with Viterbi which are way more complex part of the training than initial estimation of the states with vector quantization.

Then, how do I deal with different sized words. I mean, If I have trained words of 500ms and 300ms, how many observable symbols do I introduce to compare with all the models?

If you decode speech you usually select the symbols which correspond to parts phonemes perceived by the human. Its traditional to take 3 symbols per phoneme. For example word "one" should have 9 states for 3 phonemes and word "seven" should have 15 states for 5 phonemes. This practice is proven to be effective. Of course you can vary this estimation slightly.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow