Transfer learning VGGish (AudioSet). Impact of zero padding to fit the input size

https://datascience.stackexchange.com/questions/55490

02-11-2019
|

Question

I am trying to train a network on top of the VGGish architecture (https://github.com/tensorflow/models/tree/master/research/audioset/vggish), using (transfer learning) finetuning.

I initially started out with just finetuning the embedding layers, and training from scratch a simple MLP with some regularization techniques. (2 hidden layers FC with ReLU, BatchNorm and dropout, final output currently is FC to 4 classes). It is able to get an accuracy of 1 on the training data.

The input of the VGGish network are spectograms. My collected audiofiles are pretty low quality OGG/vorbis with a sample rate of 8000Hz. This is lower than the audio used in VGGish, therefore, if I create the log-mel spectogram using their code, every spectogram is missing some bins. (see bands 48 and up in the image below, y-axis) note that this happens in every image, so my hypothesis is that this shouldn't really matter. It is essentially just padding to fit the input size of the VGGish network.

Will this have any negative consequences on the ability of the network to learn using my dataset? Do I also need to finetune the ConvLayers earlier on in the network?

Any other thoughts on how I can work around this (potential) issue?It is not feasible to increase the frequency of the recordings, as they are being streamed from low power devices.

I do not have enough (annotated) data to fully retrain, so training a model from scratch is not feasible.

Suggestions on other pretrained networks are also welcome!

Thanks :)

No correct solution

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange