Using Sphinx4 and es_MX_broadcast_cont_2500

https://stackoverflow.com/questions/19107457

29-06-2022
|

Question

I currently work in developing an audio transcriber of short Spanish(MX) interviews (length~2min). I've been surfing on the web but can't find this one, maybe it's too easy :/ . While running the .jar i get this warning for (i presume) all the word with accents in the /etc/h4.dict from the es_MX_broadcast... voxforge package and no transcription or other errors at all.

...

WARNING dictionary The dictionary is missing a phonetic transcription for the word 'kyrgyzst�'

'WARNING dictionary The dictionary is missing a phonetic transcription for the word 'explotaci�'

WARNING dictionary The dictionary is missing a phonetic transcription for the word 'inclu�'

...

My clue is that maybe there are some configuration issues with the text encoder but maybe i need to create the language model. I really want to train it, but first i need it working. Here is the linguist/dictionary/language_model/acoustic_model part of config.xml file

<component name="lexTreeLinguist" 
            type="edu.cmu.sphinx.linguist.lextree.LexTreeLinguist">
    <property name="logMath" value="logMath"/>
    <property name="acousticModel" value="wsj"/>
    <property name="languageModel" value="trigramModel"/>
    <property name="dictionary" value="dictionary"/>
    <property name="addFillerWords" value="false"/>
    <property name="fillerInsertionProbability" value="1E-10"/>
    <property name="generateUnitStates" value="false"/>
    <property name="wantUnigramSmear" value="true"/>
    <property name="unigramSmearWeight" value="1"/>
    <property name="wordInsertionProbability" 
            value="${wordInsertionProbability}"/>
    <property name="silenceInsertionProbability" 
            value="${silenceInsertionProbability}"/>
    <property name="languageWeight" value="${languageWeight}"/>
    <property name="unitManager" value="unitManager"/>
</component>    

<component name="dictionary" 
    type="edu.cmu.sphinx.linguist.dictionary.FastDictionary">
    <property name="dictionaryPath"
              value="/home/csampez/Desktop/JavaDev/Sphinx/sphinx4/models/acoustic/es_MX_broadcast_cont_2500/etc/h4.dict"/>
    <property name="fillerPath" 
      value="/home/csampez/Desktop/JavaDev/Sphinx/sphinx4/models/acoustic/es_MX_broadcast_cont_2500/etc/filler.dict"/>
    <property name="addSilEndingPronunciation" value="false"/>
    <property name="wordReplacement" value="&lt;sil&gt;"/>
    <property name="unitManager" value="unitManager"/>
</component>

<component name="trigramModel" 
      type="edu.cmu.sphinx.linguist.language.ngram.large.LargeTrigramModel">
    <property name="unigramWeight" value=".7"/>
    <property name="maxDepth" value="3"/>
    <property name="logMath" value="logMath"/>
    <property name="dictionary" value="dictionary"/>
    <property name="location"
     value="/home/csampez/Desktop/JavaDev/Sphinx/sphinx4/models/acoustic/es_MX_broadcast_cont_2500/etc/H4.arpa.Z.DMP"/>
</component>

<component name="wsj"
           type="edu.cmu.sphinx.linguist.acoustic.tiedstate.TiedStateAcousticModel">
    <property name="loader" value="wsjLoader"/>
    <property name="unitManager" value="unitManager"/>
</component>

<component name="wsjLoader" type="edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader">
    <property name="logMath" value="logMath"/>
    <property name="unitManager" value="unitManager"/>
    <property name="location" value="/home/csampez/Desktop/JavaDev/Sphinx/sphinx4/models/acoustic/es_MX_broadcast_cont_2500/model_parameters/hub4_spanish_itesm.cd_cont_2500"/>
</component>

------- THIS IS NEW INFORMATION (10/3/2013)----------

Thanks but it isn't the problem. The file was already UTF8 and i've already set the JAVA TOOLS OPTION to UTF8, also run the .jar with the -Dfile.encoding and anything changed, i get the same list. It's strange because i've tried to figure out whether is another dictionary list in the files, but i'm clueless. It's something really weird because the h4.dict is in uppercase and the warnings in lower case, also there are some words with accent that don't appear in the warning list. I tried to save another .dict file with fewer words but it didn't work, in fact more words appeared in the warnings.

I don't know if it matters that i'm not using a .jar for the acoustic model like the ones used in the other demos or if there's a relation with the fact that there's no transcription or other errors at all.

I really hope anyone can help me figure out, in the meanwhile i'll be trying harder.

Many thanks on advance

Solution

You need to convert file to UTF-8

You need to use java option -Dfile.encoding=utf-8 to make sure java VM thinks that all input files are in UTF-8

Most importantly, es_MX_broadcast_cont requires specific feature extractor. You need to replace DeltasFeatureExtractor with S3FeatureExtractor in config file. Otherwise accuracy will be zero.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow