Each of your vectors will represent the spectral characteristics of your audio file, as it varies in time. Depending on the length of your frames, you might want to group some of them (for example by averaging by dimension) to match the resolution with which you want the classifier to work. As an example, think of a particular sound that might have an envelope with an Attack time of 2ms: that may be as fine-grained as you want to get with your time quantization so you could a) group and average the number of MFCC vectors that represent 2ms; or b) recompute the MFCCs with the desired time resolution.
If you really want to keep the resolution that fine, you can concatenate the 291 vectors and treat it like a single vector (of 291 x 13 dimensions), which will probably need a huge dataset to train on.