Question

Is it possible to create a neural network which provides a consistent output given that the input can be in different length vectors?

I am currently in a situation where I have sampled a lot of audio files, which are of different length, and have to train a neural network provides me the desired output given a certain input. I am trying to create a regression network that can generate MFCC feature, given samples of an audio file, which are of different length, which makes different numbered input.

Was it helpful?

Solution

Yes this is possible by treating the audio as a sequence into a Recurrent Neural Network (RNN). You can train a RNN against a target that is correct at the end of a sequence, or even to predict another sequence offset from the input.

Do note however that there is a bit to learn about options that go into the construction and training of a RNN, that you will not already have studied whilst looking at simpler layered feed-forward networks. Modern RNNs make use of layer designs which include memory gates - the two most popular architectures are LSTM and GRU, and these add more trainable parameters into each layer as the memory gates need to learn weights in addition to the weights between and within the layer.

RNNs are used extensively to predict from audio sequences that have already been processed in MFCC or similar feature sets, because they can handle sequenced data as input and/or output, and this is a desirable feature when dealing with variable length data such as spoken word, music etc.

Some other things worth noting:

  • RNNs can work well for sequences of data that are variable length, and where there is a well-defined dimension over which the sequences evolve. But they are less well adapted for variable-sized sets of features where there is no clear order or sequence.

  • RNNs can get state-of-the-art results for signal processing, NLP and related tasks, but only when there is a very large amount of training data. Other, simpler, models can work just as well or better if there is less data.

  • For the specific problem of generating MFCCs from raw audio samples: Whilst it should be possible to create a RNN that predicts MFCC features from raw audio, this might take some effort and experimentation to get right, and could take a lot of processing power to make an RNN powerful enough to cope with very long sequences at normal audio sample rates. Whilst creating MFCC from raw audio using the standard approach starting with FFT will be a lot simpler, and is guaranteed to be accurate.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top