Pergunta

This is probably very silly question, but I couldn't find details anywhere.

So I have an audio recording (wav file) that is 3 seconds long. That is my sample and it needs to be classified as [class_A] or [class_B].

By following some tutroial on MFCC, I divided the sample into frames (291 frames to be exact) and I've gotten MFCCs from each frame.

Now I have 291 feature vectors, the length of each vector is 13.

My question is; how exactly do you use those vectors with classifier (k-NN for example)? I have 291 vectors that represent 1 sample. I know how to work with 1 vector for 1 sample, but I don't know what to do if I have 291 of them. I couldn't really find explanation anywhere.

Foi útil?

Solução

Each of your vectors will represent the spectral characteristics of your audio file, as it varies in time. Depending on the length of your frames, you might want to group some of them (for example by averaging by dimension) to match the resolution with which you want the classifier to work. As an example, think of a particular sound that might have an envelope with an Attack time of 2ms: that may be as fine-grained as you want to get with your time quantization so you could a) group and average the number of MFCC vectors that represent 2ms; or b) recompute the MFCCs with the desired time resolution.

If you really want to keep the resolution that fine, you can concatenate the 291 vectors and treat it like a single vector (of 291 x 13 dimensions), which will probably need a huge dataset to train on.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top