eliminate known audio from recorded sound to estimate background sound level via deconvolution

Question 1

With vDSP_zvsub you're just doing a complex subtraction at each bin, which is probably not what you want.

It's not clear exactly what you're trying to achieve, but it sounds like you want to subtract the magnitude of one spectrum from the other, in which case you would need to do the following:

convert each complex frequency domain spectrum from complex to polar (magnitude + phase)
subtract the magnitudes at each bin
convert the resulting polar data back to complex

Question 2

The length argument to vDSP_zvsub is the number of complex elements to be processed, not the logarithm of the number of elements. You should pass nOver2 rather than log2n.

This merely addresses the programming aspect. Other answers address signal processing issues. In particular, an FFT is linear: Given signals X and Y and constants a and b, FFT(a•X+b•Y) = a•FFT(X)+b•FFT(Y). The inverse FFT is also linear. Therefore, an inverse FFT of the difference of the FFTs of two signals should not give you a different result from subtracting two signals directly, except for the usual floating-point rounding errors.

Question 3

You will need the system impulse response between the audio sent to the speaker and the audio received from the mic (DAC/ADC buffering delay, anti-aliasing filter group delay, speaker and mic responses, speed of sound in air, etc.) in order to produce a (mostly) canceling signal, either in the time domain or in the frequency domain. Note that this includes matching amplitudes as well as delays, and that one set of speakers or mics may well be "out of phase" compared to others.