if there is a third option I'm overlooking
Yes: doing both at the same time, a reduction of the FFT size as well as a larger step size. In a comment you pointed out that you want to detect "sniffling/chewing with mouth". So, what you want to do is similar to the typical task of speech recognition. There, you typically extract a feature vector in steps of 10ms (meaning with Fs=44.1kHz every 441 samples) and the signal window to transform is roughly about double the size of the step size, so 20ms which yields to a 2^X FFT size of 1024 samples (make sure that you choose an FFT size which is a power of 2, because it is faster).
Any increase in window size or reduction in step size increases the data but mainly adds redundancy.
Additional hints:
@SztupY correctly pointed out that you need to "window" your signal prior to the FFT, typically with a Hamming-wondow. (But this is not "filtering". It is just multiplying each sample value with the corresponding window value without accumulating the result).
The raw FFT output is hardly suited to recognize "sniffling/chewing with mouth", a classical recognizer consists of HMMs or ANNs which process sequences of MFCCs and their deltas.
Could the performance I'm currently getting just be the best I'm going to get? Or does it seem like I must be something stupid because much faster speeds are possible?
It's close to the best, but you are wasting all the CPU power to estimate highly redundant data, leaving no CPU power to the recognizer.
Is my approach to this at least fundamentally correct or am I barking entirely up the wrong tree?
After considering my answer you might re-think your approach.