There are only two things that I can think of that clearly distinguish "Music" from all other Audio/sounds:
Meter: Virtually all composed music has a meter. In theory this should be detectable with an FFT, but using the frequency range of apprx. 0.25hz to 10hz (instead of the usual 20hz-20Khz). In practice? I don't know, but it seems worth a try.
Tuning: Something common to almost all professional music including the voices of professional singers (when they are musically accompanied), but not to any other sounds is that they will all be in the same "tuning" of a 12-tone Equal Tempered scale. In other words, their frequencies will always be separated by exact multiple powers of 2^(1/12). Once the tuning is established they will never be in the gaps in between these steps. Normal sounds, including human voices, are spread all over the spectrum but music is almost always within +/- 10 Cents of a scaled note.
Method #1 is iffy, I don't know if anyone's ever tried it.
But #2 is definite, you can actually see this with an Audio Spectrum Analyzer, but the FFT has to have very high discrimination (at least 36 divisions per octave). But there are some catches, such as:
- Differentiating between the music and other simultaneous sound/noise
- Stringed instruments, like guitars and violins, which often "bend" notes out of tune
- Trombones and unaccompanied human voices, that can "slide" between notes, or use Just-temper instead of Equal-temper for chords.
- Programmatically establishing what the "tune" is at different places in the film (its not necessarily absolute, just stable within any one piece of music)
- Harmonics: musical notes are usually more than simple sine waves, which means that there are a lot of harmonic frequencies mixed in there. Harmonics aren't exponential like scales, they are integer multiples, so they don't line up with the base notes. Fortunately, harmonics are almost always of lower amplitude than the base notes, so it should be possible to just "look for the peaks".
Well, those are my "clever" ideas. Now it's just a small matter of implementation ... ;-)