Convolutional Neural Network (CNN) for Audio [closed]

Question 1

We used deep convolutional networks on spectrograms for a spoken language identification task. We had around 95% accuracy on a dataset provided in this TopCoder contest. The details are here.

Plain convolutional networks do not capture the temporal characteristics, so for example in this work the output of the convolutional network was fed to a time-delay neural network. But our experiments show that even without additional elements convolutional networks can perform well at least on some tasks when the inputs have similar sizes.

Question 2

There are many techniques to extract feature vectors from audio data in order to train classifiers. The most commonly used is called MFCC (Mel-frequency cepstrum), which you can think of as a "improved" spectrogram, retaining more relevant information to discriminate between classes. Other commonly used technique is PLP (Perceptual Linear Predictive), which also gives good results. These are still many other less known.

More recently deep networks have been used to extract features vectors by themselves, thus more similarly the way we do in image recognition. This is a active area of research. Not long ago we also used feature extractors to train classifiers for images (SIFT, HOG, etc.), but these were replaced by deep learning techniques, which have raw images as inputs and extract feature vectors by themselves (indeed it's what deep learning is really all about).

It's also very important to notice that audio data is sequential. After training a classifier you need to train a sequential model as a HMM or CRF, which chooses the most likely sequences of speech units, using as input the probabilities given by your classifier.

A good starting point to learn speech recognition is Jursky and Martins: Speech and Language Processing. It explains very well all these concepts.

[EDIT: adding some potentially useful information]

There are many speech recognition toolkits with modules to extract MFCC feature vectors from audio files, but using than for this purpose is not always straightforward. I'm currently using CMU Sphinx4. It has a class named FeatureFileDumper, that can be used standalone to generate MFCC vectors from audio files.