Machine learning on classifying speech

https://datascience.stackexchange.com/questions/73861

11-12-2020
|

Question

So, I have 9k of 1 second wav files of a person speaking. These are labeled by whether the person speaking is wearing a face mask or not. I am supposed to come up with a machine learning model to classify these criteria.

So far I tried using KNN on the mfcc feature of the audio. This gets around 56% accuracy on test data.

I also tried to convert the wav files to jpg spectograms and apply a CNN. This one gets 60% accuracy on test data.

But I don't have that much experience, I am not sure which features of audio would best help to this problem.

Also, if you could recommend a machine learning model for this particular problem.

Solution

If you have 4,500 examples of each category, you’re doing better than random guessing. This sounds like a hard problem where the examples will have only subtle differences, so that’s an accomplishment. (I assume you’re doing some kind of out-of-sample testing.) Consider how you do in classifying perhaps 20 examples of each. I would be curious to hear how you do. This is quite a different problem if you get 38/40 than 20/40.

The way to apply a CNN, though, would be to convert to a spectrogram and then run the CNN on the spectrogram array. Converting to a literal picture is unnecessary and might harm performance. There are many ways to convert your signal to two dimensions of time-frequency space. You’ve probably tried a Fourier transform. Check out wavelets.

Since you have time series data, consider recurrent neural nets and long short-term memory. These can be combined with CNN. You will find examples of CNN/LSTM code on GitHub, I’d imagine.

You have a fairly small data set, however, for deep learning approaches. Consider simpler models like logistic regression. Your in-sample performance may suffer while improving out-of-sample performance.

Finally, consider a proper scoring rule like Brier score. Frank Harrell has written much about this on the statistics Stack, Cross Validated. Shamelessly, I will mention that you may be interested in a question of mine on CV where I somewhat challenge the idea of proper scoring rules when there is a hard decision to be made: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email. My example comes from NLP, but there could be a speech situation (say text dictation, where, at some point, you have to decide to print “dad” or “bad” or “bicycle”).

OTHER TIPS

Some recommendations based on what I've done. Here is a useful tutorial, which explains how to implement a CNN for wav files.

https://medium.com/gradientcrescent/urban-sound-classification-using-convolutional-neural-networks-with-keras-theory-and-486e92785df4.

In my case, it was overfitting and I wasn't able to fix that.

This simple NN model gave the best accuracy, of around 67%. Here is the notebook used

Also, as it was pointed previously, the training data set is fairly small for using neural networks. So in addition I've used audio data augmentation to avoid overfitting and increase train data size.

I would recommend to try a pretrained CNN to extract features, then do a simple classifier on top of that. OpenL3 for example is very easy to use, and performs pretty well on a range of tasks. The classifier could be for example Logistic Regression, or Random Forest.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange