FFT on iPhone to ignore background noise and find lower pitches

https://stackoverflow.com/questions/7181630

11-01-2021
|

Question

I have implemented Demetri's Pitch Detector project for the iPhone and hitting up against two problems. 1) any sort of background noise sends the frequency reading bananas and 2) lower frequency sounds aren't being pitched correctly. I tried to tune my guitar and while the higher strings worked - the tuner could not correctly discern the low E.

The Pitch Detection code is located in RIOInterface.mm and goes something like this ...

// get the data
AudioUnitRender(...);

// convert int16 to float
Convert(...);

// divide the signal into even-odd configuration
vDSP_ctoz((COMPLEX*)outputBuffer, 2, &A, 1, nOver2);

// apply the fft
vDSP_fft_zrip(fftSetup, &A, stride, log2n, FFT_FORWARD);

// convert split real form to split vector
vDSP_ztoc(&A, 1, (COMPLEX *)outputBuffer, 2, nOver2);

Demetri then goes on to determine the 'dominant' frequency as follows:

float dominantFrequency = 0;
int bin = -1;
for (int i=0; i<n; i+=2) {
    float curFreq = MagnitudeSquared(outputBuffer[i], outputBuffer[i+1]);
    if (curFreq > dominantFrequency) {
        dominantFrequency = curFreq;
        bin = (i+1)/2;
    }
}
memset(outputBuffer, 0, n*sizeof(SInt16));

// Update the UI with our newly acquired frequency value.
[THIS->listener frequencyChangedWithValue:bin*(THIS->sampleRate/bufferCapacity)];

To start with, I believe I need to apply a LOW PASS FILTER ... but I'm not an FFT expert and not sure exactly where or how to do that against the data returned from the vDSP functions. I'm also not sure how to improve the accuracy of the code in the lower frequencies. There seem to be other algorithms to determine the dominant frequency - but again, looking for a kick in the right direction when using the data returned by Apple's Accelerate framework.

UPDATE:

The accelerate framework actually has some windowing functions. I setup a basic window like this

windowSize = maxFrames;
transferBuffer = (float*)malloc(sizeof(float)*windowSize);
window = (float*)malloc(sizeof(float)*windowSize);
memset(window, 0, sizeof(float)*windowSize);
vDSP_hann_window(window, windowSize, vDSP_HANN_NORM);

which I then apply by inserting

vDSP_vmul(outputBuffer, 1, window, 1, transferBuffer, 1, windowSize);

before the vDSP_ctoz function. I then change the rest of the code to use 'transferBuffer' instead of outputBuffer ... but so far, haven't noticed any dramatic changes in the final pitch guess.

Solution

Pitch is not the same as peak magnitude frequency bin (which is what the FFT in the Accelerate framework might give you directly). So any peak frequency detector will not be reliable for pitch estimation. A low-pass filter will not help when the note has a missing or very weak fundamental (common in some voice, piano and guitar sounds) and/or lots of powerful overtones in its spectrum.

Look at a wide-band spectrum or spectrograph of your musical sounds and you will see the problem.

Other methods are usually needed for a more reliable estimate of musical pitch. Some of these include autocorrelation methods (AMDF, ASDF), Cepstrum/Cepstral analysis, harmonic product spectrum, phase vocoder, and/or composite algorithms such as RAPT (Robust Algorithm for Pitch Tracking) and YAAPT. An FFT is useful as only a sub-part of some of the above methods.

OTHER TIPS

At the very least you need to apply a window function to your time domain data, prior to calculating the FFT. Without this step the power spectrum will contain artefacts (see: spectral leakage) which will interfere with your attempts at extracting pitch information.

A simple Hann (aka Hanning) window should suffice.

What is your sample frequency and blocksize? Low E is around 80 Hz, so you need to make sure your capture block is long enough to capture many cycles at this frequency. This is because the Fourier Transform divides the frequency spectrum into bins, each several Hz wide. If you sample at 44.1 kHz and have a 1024 point time domain sample, for instance, each bin will be 44100/1024 = 43.07 Hz wide. Thus a low E would be in the second bin. For a bunch of reasons (to do with spectral leakage and the nature of finite time blocks), practically speaking you should consider the first 3 or 4 bins of data in an FFT result with extreme suspicion.

If you drop the sample rate to 8 kHz, the same blocksize gives you bins that are 7.8125 Hz wide. Now low E will be in the 10th or 11th bin, which is much better. You could also use a longer blocksize.

And as Paul R points out, you MUST use a window to reduce spectral leakage.

The frequency response function of the iPhone drops off below 100 - 200 Hz (see http://blog.faberacoustical.com/2009/ios/iphone/iphone-microphone-frequency-response-comparison/ for an example).

If you are trying to detect the fundamental mode of a low guitar string, the microphone might be acting as a filter and suppressing the frequency you are interested in. There are a couple of options if you interested in using the fft data you can get - you can window the data in the frequency domain around the note you are trying to detect so that all you can see is the first mode even if it is of lower magnitude than higher modes(i.e. have a toggle to tune the first string and put it in this mode).

Or you can low pass filter the sound data - you can do this either in the time domain or even easier since you already have frequency domain data, in the frequency domain. A very simple time domain low pass filter is to do a time-moving average filter. A very simple frequency domain low pass filter is to multiply your fft magnitudes by a vector with 1's in the low frequency range and a linear (or even a step) ramp down in the higher frequencies.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow