Microsoft Speech Platform - sampling rate and bit depth

https://stackoverflow.com/questions/18142501

24-06-2022
|

Question

Recognition results are best if sampling rate and bit depth of the audio match the training data of the system.

So, does anyone know the exact sampling rate and/or bit depth (and/or stereo/mono) that is used in Microsoft Speech Platform (newest, if that's important)? And if so, do you remember where you got this information?

Please note that I am using the MS Speech Platform, not the SAPI. Unless both are using the same training data, that's not the same AFAIK. To be precise - I use this: http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.speechrecognitionengine.setinputtowavefile%28v=office.14%29.aspx

My first try is based upon the C++ code example given on the page.

Solution

The Microsoft.Speech SR engine doesn't need training (unlike the System.Speech SR engine), and is relatively insensitive to sampling rate (will work with anything > 8 KHz sampling rate). 16 bit audio is preferred, but I believe that it will work with 8 bit audio.

OTHER TIPS

I couldn't find any information regarding sample rate, but it seems the bit depth is actually 8-bit (maybe this has changed since Eric Brown's answer).

Quoted from this page listing supported audio formats:

The Speech Platform downsamples audio that is of greater than 8-bit resolution.

You should be fine providing any bit-depth which is a multiple of 8-bits (which is always the case anyway), since there will be no precision loss due to rounding (and there is no aliasing for resolution, unlike sample rate).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow