Pergunta

I would like to programmatically check whether a string can be pronounced or needs to be spelled out.

For example, internationalization can be read out, but i18n cannot, nor can hhdirgxzf.

I can think of some simple heuristics such as checking whether the string contains non-alpha characters, but I hope there is a more robust and scientific way to do it. Are there algorithmic approaches that can score a string based on how easy it is to pronounce?

Related: Is there a way to rank the difficulty of pronunciation of a word?, however I don't have a list and I can't precompute.


Update based on comments.

  • As I'm an English speaker I'm interested in English but I could imagine an algorithm that was based on the way sound and speaking works rather than the characteristics of a particular language.
  • By pronounced I mean the string can be read out naturally, it's possible to pronounce hhdirgxzf but it would not sound one natural language word, it would need to be broken up.
  • a specific use case I have in mind is where I am sent strings, and I want to use a basic text-to-speech system to read them out loud. I want to determine which tokens in the string to let the TTS system try to pronounce, and which to make it spell out, erring on the side of spelling out if not confident.
Foi útil?

Solução

You might have some success by first splitting the word into syllables. This question on SO might help. Of course, this will only work for languages which, like English, use an alphabet which includes letters and whose letters include vowel sounds.

Outras dicas

Maybe count the alpha characters, and divide them with the length of the string. Score based on alpha characters density? Also, maybe decrease score per number?

What is the source of these strings? If you are generating them yourself, then you could try to generate likely pronounceable strings. Ideas that might work include:

  • start with a word and replace vowels with other vowels and consonants with similar consonants.

  • generate a random Soundex and work backwards to a word that generates that Soundex.

  • concatenate three or four pronounceable syllables.

  • alternate consonants and vowels.

  • Lorem Ipsum

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top