FFT and inverse of FFT

Question 1

If you are working on any mass-market computing platform (Intel with Windows or OS X, iOS, et cetera), then there is a high-performance FFT implementation supplied by the vendor or manufacturer.

Otherwise, you should evaluate FFTW.

Writing high-performance FFTs for sizes other than powers of two is a complicated task.

If you are going to use your own implementation, then, regarding just the power-of-two sizes:

The implementation you show computes sqrt during the FFT. Most high-performance FFT implementations compute constants ahead of time and store them in a table.

The scaling contains division operations, in x[i] /= n and y[i] /= n. The compiler is likely to implement these as divide instructions. Division is typically a slow instruction on common processors. It would be better to compute scale = 1. / n once and multiply by scale instead of dividing by n.

Better yet would be to omit the scale entirely. The transform is often useful without the scale, or the scale can be omitted from individual transforms and applied just once as an aggregated scale. (E.g., instead of doing two scale operations, one in the forward FFT and one in the inverse FFT, leave the scale operation out of the FFT routines and do it just once after both the forward FFT and the inverse FFT.)

You might be able to omit the bit-reversal permutation, if it is acceptable to have the frequency-domain data in bit-reversed order.

If you keep the bit-reversal permutation, it can be optimized. Techniques for doing this are platform-dependent. Some platforms have an instruction for reversing the bits in an integer (e.g., ARM has rbit). If your platform does not, you might want to keep the bit-reversal indices in a table or investigate ways of computing them more quickly than your current code.

If you keep both the bit-reversal permutation and the scaling, you should consider doing them at the same time. The permutation uses a lot of memory motion, and the scaling uses the processor’s arithmetic unit. Most modern processors can do both of these at once, so you would get some benefit from overlapping operations.

Your current code uses a radix-2 butterfly. Radix-4 is usually better, because it gains some advantage from the fact that multiplication by i can be done just by changing which piece of data is used and changing some additions to subtractions and vice-versa.

If your array length approaches the size of level-one memory cache on your processor, parts of the FFT implementation will thrash cache and slow down significantly unless you design appropriate code to deal with this (often by copying parts of the array into a buffer temporarily).

If your target processor has SIMD features, you should absolutely use those in the FFT; they speed up FFT performance tremendously.

The above should convey to you that writing an efficient FFT is a complicated task. Unless you want to expend significant effort on it, you are likely better off using FFTW or other existing implementations.

Question 2

In your implementation, I'm concerned by this bit of code:

     z =  u1 * c1 - u2 * c2;
     u2 = u1 * c2 + u2 * c1;
     u1 = z;

(u1, u2) is subject to lots and lots of accumulated roundoff error when l1 is large. You will likely get an inaccurate transform.

I'd echo the recommendation of FFTW, but I believe it's highly platform-specific. (Most FFT libraries are.) [EDIT: No, it's actually straight C89. It's exactly what you say you're looking for.]

The Wikipedia FFT page lists a bunch of algorithms that work for weird-sized input arrays. I don't know how they work, but I believe the general idea is that you use Rader's algorithm or Bluestein's algorithm for prime-sized inputs and Cooley-Tukey to reduce composite-sized transforms to a bunch of prime-sized transforms.

Question 3

For an alternative to FFTW check out my mix-radix FFT, which also handles non-2^N FFT lengths. The C-source is available from http://www.corix.dk/Mix-FFT/mix-fft.html.