Music Detection (Not Identification)

Question 1

There are only two things that I can think of that clearly distinguish "Music" from all other Audio/sounds:

Meter: Virtually all composed music has a meter. In theory this should be detectable with an FFT, but using the frequency range of apprx. 0.25hz to 10hz (instead of the usual 20hz-20Khz). In practice? I don't know, but it seems worth a try.
Tuning: Something common to almost all professional music including the voices of professional singers (when they are musically accompanied), but not to any other sounds is that they will all be in the same "tuning" of a 12-tone Equal Tempered scale. In other words, their frequencies will always be separated by exact multiple powers of 2^(1/12). Once the tuning is established they will never be in the gaps in between these steps. Normal sounds, including human voices, are spread all over the spectrum but music is almost always within +/- 10 Cents of a scaled note.

Method #1 is iffy, I don't know if anyone's ever tried it.

But #2 is definite, you can actually see this with an Audio Spectrum Analyzer, but the FFT has to have very high discrimination (at least 36 divisions per octave). But there are some catches, such as:

Differentiating between the music and other simultaneous sound/noise
Stringed instruments, like guitars and violins, which often "bend" notes out of tune
Trombones and unaccompanied human voices, that can "slide" between notes, or use Just-temper instead of Equal-temper for chords.
Programmatically establishing what the "tune" is at different places in the film (its not necessarily absolute, just stable within any one piece of music)
Harmonics: musical notes are usually more than simple sine waves, which means that there are a lot of harmonic frequencies mixed in there. Harmonics aren't exponential like scales, they are integer multiples, so they don't line up with the base notes. Fortunately, harmonics are almost always of lower amplitude than the base notes, so it should be possible to just "look for the peaks".

Well, those are my "clever" ideas. Now it's just a small matter of implementation ... ;-)

Question 2

you can use 'Microsoft Expression Encoder' to work with videos and audios

Question 3

The OP's problem can be summarized as follows:

In the generalized audio stream of a video, try to detect "music" versus "everything else".

Where "music" is not likely to exist in fingerprint databases.

And where "everything else" in this context must include:

speech
silence
synthetic sounds
foley sounds (explosions, gunshots, footfalls, etc.)

We must also assume that the audio soundtrack of a generalized video is highly processed with echo, reverb, multichannel panning, etc.

In the general video case, all of the above audio elements would be mixed together into the final audio, making the problem domain absolutely immense.

This is a very challenging problem, with most likely no simple or robust solution.

In support of this premise, a general music classifier (let's call it MuCLAS), where the unknown music sample is a member of the classifier training set, is a very difficult problem, due to the significant expense involved in creating the training set, and in tuning and creating the classifier index.

But the OP's problem domain is much larger than the MuCLAS problem domain, due to the much higher entropy of the OP's unknown data set. This implies much higher complexity and cost, relative to MuCLAS.

Another supporting argument for the above premise, is that the state of the art in general speech recognition assumes and insists upon, much lower entropy in the unknown data set, than the implied entropy of the OP's data set.

And speech recognition is one of the best funded problems in the general field of autonomous pattern recognition.