Question

I use to create an application that uses the windows speech recognition engine or the SAPI. its like a game for pronunciation that it give you score when you pronounce it correctly. but when i started experiments with SAPI, it has poor recognition unless if you load a grammar on it (XML) its give best recognition result.

but the problem now is closest pronunciation from the input text will be recognize. for example:

Database -> dedebase -> correct.

even if you mispronounce it. it gives you correct answers.

without using the xml grammar

when you say database it give you "in the base/the base/data base/etc..."

please post your answer,suggestion,clarification. votes for best answer.

is it possible or not?

by the way i use delphi compiler on the projects....

Was it helpful?

Solution

For what you want, it is probably best not to use a grammar. But it requires that the users do the "minimal" basic training of the speech recognition engine. It's not very long and relatively pleasant. And it really makes a difference on the recognition accuracy (believe me, I have a strong French accent in my English).
It can even be included as a preliminary practice for the game itself.
You may find interesting to see this CodeRage 4 session on "Speech Enabling Delphi Applications (zip)"

OTHER TIPS

I'd do two things:

  1. Convert the original text to phonemes by using ISpEnginePronunciation::GetPronunciations.
  2. Use a dictation grammar and the pronunciation language model to force SAPI to give you back a set of phonemes - do this by calling ISpRecoGrammar::LoadDictation(L"Pronunciation", SPLO_STATIC).
  3. Compare the recognized phonemes to the target phonemes.

Note that ISpEnginePronunciation isn't available on SAPI 5.1, so this is limited to Vista and Windows 7.

If the point of the game is to encourage the user to speak using pronunciation that is closest to "standard pronunciation" for a given language (e.g. EN-US), then having the user train the recognizer to adapt to the user's particular (unmodified) speech patterns may be counterproductive. You would in part be training the recognizer to be more forgiving of the user's pronunciation lapses.

Whether you end up using grammar-based recognition or dictation-based recognition (Eric Brown's post looks very promising), you will probably also want to look into "confidence" scores. These scores are available after a recognition has been performed, and they give a numeric value to how confident the recognizer is that what the user actually said matches what the recognizer thinks the user said. Depending on the recognizer configuration and use case, confidence scores may or may not be meaningful.

If you are basing your accuracy score off of the textual representation of the phones/phonemes/pronunciation, a quick and easy way to get an accuracy score would be to use Levenshtein distance, an algorithm for which there are many implementations freely available on the net. A better scoring algorithm might be a resynchronizing diff, with the atomic unit of comparison being single phones.

Here are some keywords for MSDN doc hunting:
ISpRecoResult -> GetPhrase -> SPPHRASE -> Rule -> SPPHRASERULE -> SREngineConfidence.

http://msdn.microsoft.com/en-us/library/ee413319%28v=vs.85%29.aspx
http://msdn.microsoft.com/en-us/library/ms720460%28v=VS.85%29.aspx

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top