Deep Learning with Spectrograms for sound recognition

https://datascience.stackexchange.com/questions/10025

16-10-2019
|

Question

I was looking into the possibility to classify sound (for example sounds of animals) using spectrograms. The idea is to use a deep convolutional neural networks to recognize segments in the spectrogram and output one (or many) class labels. This is not a new idea (see for example whale sound classification or music style recognition).

The problem that I'm facing is that I have sound files of different length and therefore spectrograms of different sizes. So far, every approach I have seen uses a fixed size sound sample but I can't do that because my sound file might be 10 seconds or 2 minutes long.

With, for example, a bird sound in the beginning and a frog sound at the end (output should be "Bird, Frog"). My current solution would be to add a temporal component to the neural network (creating more of a recurrent neural network) but I would like to keep it simple for now. Any ideas, links, tutorials, ...?

Solution 2

RNNs were not producing good enough results and are also hard to train so I went with CNNs.

Because a specific animal sound is only a few seconds long we can divide the spectrogram into chunks. I used a length of 3 seconds. We then perform classification on each chunk and average the outputs to create a single prediction per audio file. This works really well and is also simple to implement.

A more in-depth explanation can be found here: http://ceur-ws.org/Vol-1609/16090547.pdf

OTHER TIPS

For automatic speech recognition (ASR), filter bank features perform as good as CNN on spectrograms Table 1. You can train a DBN-DNN system on fbank for classifying animals sounds.

In practice longer speech utterances are divided into shorter utterances since Viterbi decoding doesn't work well for longer utterances. You could do the same.

You can divide the longer utterances into smaller utterances of fixed length. Dividing the longer utterances into smaller is easy. The problem comes in increasing the length the smaller utterances to reach fixed length.

You could warp the frequency axis of the spectrogram for augmenting the smaller utterances. This data augmentation has been shown to improve ASR performance data augumentation.

For a longer utterance with multiple sounds in it, you could use music segmentation algorithms to divide it into multiple utterances. These utterances can be made of fixed length either by division or augmentation.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange