It isn't easy for anything except a trivial signal. Almost all western 'classical' and commercial music features coincident drum sounds.
1: Superposition: The original sources add together in a similar manner in the frequency domain as they do in the time domain. Each FFT bin contains contributions from all instruments currently being played (and those which are undamped and still decaying, or resonating sympathetically). Unpicking the various sources is hard - and certainly not a comparison with a library of spectra.
2: The FFT by its definition windows audio in the time domain and yields magnitude and phase of the basis function in each bin over that window period. The best you could say is that content appeared in the bin corresponding to a drum sound within the window period. If you were to compute a 1024 point FFT, the window duration would be 23ms at 44.1kHz. To put this into a musical perspective, 16th notes at 120bpm are 31.3ms apart. You might get away with smaller FFTs.
3: Percussion instrument signals tend to look a lot like noise - at least at the point where the instrument is hit. That is to say, there will be energy spread across the spectrum and no obviously dominant frequencies. After impact, tuned percussion starts to look more 'tonal'.
You probably need to look at a time-domain approach to accurately detect the onset point (onset detection). From there you could look at time or frequency domain characteristics of the signal to try and deduce the instrument in question. There's probably also a lot you could do with a priori knowledge of the genre of music being played, allowing you to predict the patterns that are likely to be present.
This is a particular case of the more generalised audio source separation problem. There's been lots of academic activity in this area, and consequently a lot of published papers describing approaches. Look for source separation, music information retrieval, audio feature detection