Why isn't speech recognition advancing? [closed]

https://stackoverflow.com/questions/1102910

12-09-2019
|

Question

What's so difficult about the subject that algorithm designers are having a hard time tackling it?

Is it really that complex?

I'm having a hard time grasping why this topic is so problematic. Can anyone give me an example as to why this is the case?

Solution

Because if people find it hard to understand other people with a strong accent why do you think computers will be any better at it?

OTHER TIPS

Auditory processing is a very complex task. Human evolution has produced a system so good that we don't realize how good it is. If three persons are talking to you at the same time you will be able to focus in one signal and discard the others, even if they are louder. Noise is very well discarded too. In fact, if you hear human voice played backwards, the first stages of the auditory system will send this signal to a different processing area than if it is real speech signal, because the system will regard it as "no-voice". This is an example of the outstanding abilities humans have.

Speech recognition advanced quickly from the 70s because researchers were studying the production of voice. This is a simpler system: vocal chords excited or not, resonation of vocal tractus... it is a mechanical system easy to understand. The main product of this approach is the cepstral analysis. This led automatic speech recognition (ASR) to achieve acceptable results. But this is a sub-optimal approach. Noise separation is quite bad, even when it works more or less in clean environments, it is not going to work with loud music in the background, not as humans will.

The optimal approach depends on the understanding of the auditory system. Its first stages in the cochlea, the inferior colliculus... but also the brain is involved. And we don't know so much about this. It is being a difficult change of paradigm.

Professor Hynek Hermansky compared in a paper the current state of the research with when humans wanted to fly. We didn't know what was the secret —The feathers? wings flapping?— until we discovered Bernoulli's force.

I remember reading that Microsoft had a team working on speech recognition, and they called themselves the "Wreck a Nice Beach" team (a name given to them by their own software).

To actually turn speech into words, it's not as simple as mapping discrete sounds, there has to be an understanding of the context as well. The software would need to have a lifetime of human experience encoded in it.

This kind of problem is more general than only speech recognition. It exists also in vision processing, natural language processing, artificial intelligence, ...

Speech recognition is affected by the semantic gap problem :

The semantic gap characterizes the difference between two descriptions of an object by different linguistic representations, for instance languages or symbols. In computer science, the concept is relevant whenever ordinary human activities, observations, and tasks are transferred into a computational representation

Between an audio wave form and a textual word, the gap is big,

Between the word and its meaning, it is even bigger...

beecos iyfe peepl find it hard to arnerstand uvver peepl wif e strang acsent wie doo yoo fink compootrs wyll bee ani bettre ayt it?

I bet that took you half a second to work out what the hell I was typing and all Iw as doing was repeating Simons answer in a different 'accent'. The processing power just isn't there yet but it's getting there.

The variety in language would be the predominant factor, making it difficult. Dialects and accents would make this more complicated. Also, context. The book was read. The book was red. How do you determine the difference. The extra effort needed for this would make it easier to just type the thing in the first place.

Now, there would probably be more effort devoted to this if it was more necessary, but advances in other forms of data input have come along so quickly that it is not deemed that necessary.

Of course, there are areas where it would be great, even extremely useful or helpful. Situations where you have your hands full or can't look at a screen for input. Helping the disabled etc. But most of these are niche markets which have their own solutions. Maybe some of these are working more towards this, but most environments where computers are used are not good candidates for speech recognition. I prefer my working environment to be quiet. And endless chatter to computers would make crosstalk a realistic problem.

On top of this, unless you are dictating prose to the computer, any other type of input is easier and quicker using keyboard, mouse or touch. I did once try coding using voice input. The whole thing was painful from beginning to end.

Because Lernout&Hauspie went bust :)

(sorry, as a Belgian I couldn't resist)

The basic problem is that human language is ambiguous. Therefore, in order to understand speech, the computer (or human) needs to understand the context of what is being spoken. That context is actually the physical world the speaker and listener inhabit. And no AI program has yet demonstrated having adeep understanding of the physical world.

Speech synthesis is very complex by itself - many parameters are combined to form the resulting speech. Breaking it apart is hard even for people - sometimes you mishear one word for another.

Most of the time we human understand based on context. So that a perticular sentence is in harmony with the whole conversation unfortunately computer have a big handicap in this sense. It is just tries to capture the word not whats between it.

we would understand a foreigner whose english accent is very poor may be guess what is he trying to say instead of what is he actually saying.

To recognize speech well, you need to know what people mean - and computers aren't there yet at all.

You said it yourself, algorithm designers are working on it... but language and speech are not an algorithmic constructs. They are the peak of the development of the highly complex human system involving concepts, meta-concepts, syntax, exceptions, grammar, tonality, emotions, neuronal as well as hormon activity, etc. etc.

Language needs a highly heuristic approach and that's why progress is slow and prospects maybe not too optimistic.

I once asked a similar question to my instructor; i asked him something like what challenge is there in making a speech-to-text converter. Among the answers he gave, he asked me to pronounce 'p' and 'b'. Then he said that they differ for a very small time in the beginning, and then they sound similar. My point is that it is even hard to recognize what sound is made, recognizing voice would be even harder. Also, note that once you record people's voices, it is just numbers that you store. Imagine trying to find metrics like accent, frequency, and other parameters useful for identifying voice from nothing but input such as matrices of numbers. Computers are good at numerical processing etc, but voice is not really 'numbers'. You need to encode voice in numbers and then do all computation on them.

I would expect some advances from Google in the future because of their voice data collection through 1-800-GOOG411

It's not my field, but I do believe it is advancing, just slowly.

And I believe Simon's answer is somewhat correct in a way: part of the problem is that no two people speak alike in terms of the patterns that a computer is programmed to recognize. Thus, it is difficult to analysis speech.

Computers are not even very good at natural language processing to start with. They are great at matching but when it comes to inferring, it gets hairy.

Then, with trying to figure out the same word from hundreds of different accents/inflections and it suddenly doesn't seem so simple.

Well I have got Google Voice Search on my G1 and it works amazingly well. The answer is, the field is advancing, but you just haven't noticed!

If speech recognition was possible with substantially less MIPS than the human brain, we really could talk to the animals.

Evolution wouldn't spend all those calories on grey matter if they weren't required to do the job.

Spoken language is context sensitive, ambiguous. Computers don't deal well with ambiguous commands.

I don't agree with the assumption in the question - I have recently been introduced to Microsoft's speech recognition and am impressed. It can learn my voice after a few minutes and usually identifies common words correctly. It also allows new words to be added. It is certainly usable for my purposes (understanding chemistry).

Differentiate between recognising the (word) tokens and understanding the meaning of them.

I don't yet know about other languages or operating systems.

The problem is that there are two types of speech recognition engines. Speaker-trained ones such as Dragon are good for dictation. They can recognize almost any spoke text with fairly good accuracy, but require (a) training by the user, and (b) a good microphone.

Speaker-independent speech rec engines are most often used in telephony. They require no "training" by the user, but must know ahead of time exactly what words are expected. The application development effort to create these grammars (and deal with errors) is huge. Telephony is limited to a 4Khz bandwidth due to historical limits in our public phone network. This limited audio quality greatly hampers the speech rec engines' ability to "hear" what people are saying. Digits such as "six" or "seven" contain an ssss sound that is particularly hard for the engines to distinguish. This means that recognizing strings of digits, one of the most basic recognition tasks, is problematic. Add in regional accents, where "nine" is pronounced "nan" in some places, and accuracy really suffers.

The best hope are interfaces that combine graphics and speech rec. Think of an IPhone application that you can control with your voice.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow