Question

Recognition results are best if sampling rate and bit depth of the audio match the training data of the system.

So, does anyone know the exact sampling rate and/or bit depth (and/or stereo/mono) that is used in Microsoft Speech Platform (newest, if that's important)? And if so, do you remember where you got this information?

Please note that I am using the MS Speech Platform, not the SAPI. Unless both are using the same training data, that's not the same AFAIK. To be precise - I use this: http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.speechrecognitionengine.setinputtowavefile%28v=office.14%29.aspx

My first try is based upon the C++ code example given on the page.

Was it helpful?

Solution

The Microsoft.Speech SR engine doesn't need training (unlike the System.Speech SR engine), and is relatively insensitive to sampling rate (will work with anything > 8 KHz sampling rate). 16 bit audio is preferred, but I believe that it will work with 8 bit audio.

OTHER TIPS

I couldn't find any information regarding sample rate, but it seems the bit depth is actually 8-bit (maybe this has changed since Eric Brown's answer).

Quoted from this page listing supported audio formats:

The Speech Platform downsamples audio that is of greater than 8-bit resolution.

You should be fine providing any bit-depth which is a multiple of 8-bits (which is always the case anyway), since there will be no precision loss due to rounding (and there is no aliasing for resolution, unlike sample rate).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top