Question

I am working on developing a speech emotion recognition system for live recordings. I am using the OpenSMILE library for feature extraction. I have collected a set of audio files containing different classes of speech types and I extract features from them and train an SVM based classifier for emotion recognition. However, this completely fails when being tested on live speech. The reason being that the signal and hence the feature distribution (MFCCs, LSP, Pitch, Intensity, F0) in the live speech are quite different from those in the files. The OpenSMILE library uses portaudio for accessing the audio signal from the microphone.

I have tried playing a file (f_original) over the air and record it through the microphone then have OpenSMILE save it (f_distorted). I found that f_original and f_distorted do not sound very different to the human ear when played. However the audio signals when visualized in audacity differ quite a bit and the features extracted from f_original and f_distorted differ significantly. The file f_original is at 16000Hz and I upsample it to 44100Hz before feature extraction. The microphone records at 44100Hz.

While I do expect some distortion when recording through the microphone, the amount of distortion that I see is extreme.

Has anyone else faced similar problems? Any pointers on how to fix this.

Thanks!

Was it helpful?

Solution

This will depend a great deal on the environmental factors of the recording, including the room, the frequency response of the speaker/microphone combination and their type/position within the recording room. The software may be able to help you clean this up, but getting a clean recording will be the single most important factor affecting your software's profiling abilities.

Assuming your recording levels are set correctly, and your microphone and speakers have a relatively flat frequency response you will still be transforming the frequency profile of the sound according to the environment.

This effect may not be immediately obvious on playback, but there will a number of elements of the sound that are affected detrimentally. This has been used by composers to great effect.

See Alvin Lucier's I am sitting in a room at http://www.ubu.com/sound/lucier.html for a beautiful example of this type of composition.

Many of the transient smearing effects you hear in that recording will affect speech profiling dramatically, so the set-up of your recording will need to be considered in great detail. It's probably best to speak to a sound-engineer for tips on the recording setup, as it seems like this is the part you seem to be struggling with. e.g. you don't mention the acoustic properties of the room you are using, or the audio set-up.

You could also do an impulse response of the room/mic/speaker set-up you intend to use, and then deconvolve the recorded speech with the impulse, which should theoretically reduce the recording to a perfect representation of the original signal. This is tricky but can provide somejaw-dropping results.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top