Question

I am trying to train a network on top of the VGGish architecture (https://github.com/tensorflow/models/tree/master/research/audioset/vggish), using (transfer learning) finetuning.

I initially started out with just finetuning the embedding layers, and training from scratch a simple MLP with some regularization techniques. (2 hidden layers FC with ReLU, BatchNorm and dropout, final output currently is FC to 4 classes). It is able to get an accuracy of 1 on the training data.

The input of the VGGish network are spectograms. My collected audiofiles are pretty low quality OGG/vorbis with a sample rate of 8000Hz. This is lower than the audio used in VGGish, therefore, if I create the log-mel spectogram using their code, every spectogram is missing some bins. (see bands 48 and up in the image below, y-axis) note that this happens in every image, so my hypothesis is that this shouldn't really matter. It is essentially just padding to fit the input size of the VGGish network.

Will this have any negative consequences on the ability of the network to learn using my dataset? Do I also need to finetune the ConvLayers earlier on in the network?

In this image a mel-spectogram of one of my samples is visible. Note that bands 48 and up are empty. (y-axis)

Any other thoughts on how I can work around this (potential) issue?It is not feasible to increase the frequency of the recordings, as they are being streamed from low power devices.

I do not have enough (annotated) data to fully retrain, so training a model from scratch is not feasible.

Suggestions on other pretrained networks are also welcome!

Thanks :)

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top