Question

I want to use either sphinx4 or the HTK toolkit to build me a speech recognition application that aims to estimate ones age from voice. I understand, to a greater extent, the ststistical models involved in speech recognition. I am interested in Mel frequency cepstral coefficients and Gausian mixture models because these two are better suited to my problem domain. Do I have to use neural networks and feed in the training data from the vectors derived from the sphinx classifiers ? I am not quite sure where to start with sphinx or the HTK toolkit. I am new to sphinx and speech recognition and my application is only a prototype.

Can anyone please offer some form of guidance in this regard. Kind regards.

Was it helpful?

Solution

Usually, the first place to start for something like this is to look for prior related work from the academic community. In Minematsu et al. 2002, they used Gaussian mixture models (GMMs) over mel-frequency cepstral coefficients to distinguish between old and young speakers.

Presumably, if you have access to training data with both old and young speakers, you should be able to do the same. Even if you'd like to try another classifier back-end such as neural networks, it would probably be good to start with GMMs since you know that they should work for your task and they'll give you something to compare with whatever other classifiers you'd like to try to use.

If you're just doing this for fun or as a research project, I would recommend using HTK, since I like how modular it is. However, if this is being down for something commerical, you should probably go with Sphinx, since it can be redistributed under a BSD like license.

OTHER TIPS

I decided not to go with Sphinx 4 because its based on Hidden Markov models which is primarily used for sequencial analysis auch as speech recognition and even multimodal inputs to an interface based on the input sequence. Insted I went with a software called Praat, its for speech processing and synthesis. There is also a "plugin" if you like, called "Akustyk" which is used to analyse vowels and so on. May be that direction will be of value for you, i'm not sure.

You can then use mathlab and use the pattern recognition toolbox to implement your neural networks, GMM, or whatever approach you wish to pursue.

Hope it was helpful.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top