After some time looking through the MSDN I have found the solution.
Using the loopback recording you can listen to what's being outputted to the audio output device.
http://msdn.microsoft.com/en-gb/library/windows/desktop/dd316551(v=vs.85).aspx
This link also refers to an example of how to capture the stream:
http://msdn.microsoft.com/en-gb/library/windows/desktop/dd370800(v=vs.85).aspx
In here you can get the buffer data as shown in the example by calling:
pCaptureClient->GetBuffer(...)
All you have to do then is to check the value of those bytes. If they are all 0s then there is nothing playin..
For the speech synthesis I used the SpeechSynthesizer .NET class
http://msdn.microsoft.com/en-us/library/system.speech.synthesis.speechsynthesizer.aspx