How to extract human voice from an audio clip, using machine learning?

https://stackoverflow.com/questions/22249263

10-06-2023
|

Question

How can we use machine learning to get human voice from an audio clip which can be having a lot many noise over whole frequency domain.

Solution

As in any ML application the process is simple: collect samples, design features, train the classifier. For the samples you can use your noisy recordings or you can find a lot of noises in the web sound collections like freesound.org. For the features you can use mean-normalized mel-frequency coefficients, you can find implementation in CMUSphinx speech recognition toolkit. For classifier you can pick GMM or SVM. If you have enough data it will work fairly well.

To improve the accuracy you can add assumption that noise and voice are continuous, for that reason you can analyze detection history with hangover scheme (essentially HMM) to detect voice chunks instead of analysis of the every frame individually.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow