Question

So I'm trying to build an android app which acts as a real time audio analyzer as a precursor to a project that will involve detecting and filtering out certain sounds.

So I think I've got the basics of discrete Fourier transforms down, however I'm not sure what the best parameters should be for doing real time frequency analysis.

I get the impression that under ideal situations (unlimited computing power), I would take all the samples from the 44100 sample/sec PCM stream I'm getting from the AudioRecord class and put them through a 44100 element fifo "window" (padded to 2**16 with 0's and maybe a tapering function?) , running an FFT on the window every time a new sample came in. This would (I think), give me the spectrum for 0 - ~22 KHz updated 44100 times per second.

It seems like this is not going to happen on a smartphone. The thing is, I'm not sure which parameters of the computation I should reduce in order to make in order to make it tractable on my Galaxy Nexus while still holding on to as much quality as possible. Eventually I would like to be using an external microphone with better sensitivity.

I figure it will involve moving the window more than one sample between taking FFT's, but I have no idea at what point this becomes more detrimental to accuracy/aliasing/whatever than just doing the FFT on a smaller window, or if there is a third option I'm overlooking.

With the natively implemented KissFFT I'm using from libgdx, I seem to be able to do somewhere between 30-42 44100 element FFT's per 44100 samples and still have it be responsive (meaning that the buffer getting filled from the thread doing AudioRecord.read() isn't filling up faster than the thread doing the fft's can drain it).

So my questions are:

  1. Could the performance I'm currently getting just be the best I'm going to get? Or does it seem like I must be something stupid because much faster speeds are possible?
  2. Is my approach to this at least fundamentally correct or am I barking entirely up the wrong tree?

I'd be happy to show any of my code if that would help answer my questions, but there's a lot of it so I figured I would do so selectively instead of posting it all.

Was it helpful?

Solution

if there is a third option I'm overlooking

Yes: doing both at the same time, a reduction of the FFT size as well as a larger step size. In a comment you pointed out that you want to detect "sniffling/chewing with mouth". So, what you want to do is similar to the typical task of speech recognition. There, you typically extract a feature vector in steps of 10ms (meaning with Fs=44.1kHz every 441 samples) and the signal window to transform is roughly about double the size of the step size, so 20ms which yields to a 2^X FFT size of 1024 samples (make sure that you choose an FFT size which is a power of 2, because it is faster).

Any increase in window size or reduction in step size increases the data but mainly adds redundancy.

Additional hints:

  • @SztupY correctly pointed out that you need to "window" your signal prior to the FFT, typically with a Hamming-wondow. (But this is not "filtering". It is just multiplying each sample value with the corresponding window value without accumulating the result).

  • The raw FFT output is hardly suited to recognize "sniffling/chewing with mouth", a classical recognizer consists of HMMs or ANNs which process sequences of MFCCs and their deltas.

Could the performance I'm currently getting just be the best I'm going to get? Or does it seem like I must be something stupid because much faster speeds are possible?

It's close to the best, but you are wasting all the CPU power to estimate highly redundant data, leaving no CPU power to the recognizer.

Is my approach to this at least fundamentally correct or am I barking entirely up the wrong tree?

After considering my answer you might re-think your approach.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top