I would recommend the following approach:
- Find the envelope of the signal in the time domain (see Hilbert transform).
- Smooth the envelope a bit.
- Take the diff and find peaks to get the onsets of the tones.
- Use the onsets to pick frames and find the spectrum using fft.
- Find the index of the max in each of the spectrums and convert them to a frequency.
The tricky part in this is to get a robust onset detector in point 3. The peaks in the difference you pick, has to be of a certain size in order to qualify as on onset. If your tones are of varying strength this might pose a problem, but from your image of the time signal it doesn't seem like a problem.
Regards