Question

Does anyone know of a way to detect if a remote app has failed/crashed? I'm meaning when it becomes unusable - you'd usually see "Not Responding" in the title bar, in this case - but the key is that the app is still running; therefore just finding the process no longer running is not enough.

WMI does not support use of System.Diagnostics.Process.Responding on a remote machine.. and their seems to be no other WMI properties I can query in Win32_Process for this kind of information.

Was it helpful?

Solution

In determining 'liveness' of a program it is important to measure that aspect the defines it being alive in a useful manner.

Several simple 'proxy' approaches are superficially appealing due to their simplicity but fundamentally do not measure the important aspect.

Perhaps the most common are the "Is the process alive" and "separate heartbeat broadcast thread" probably because it is so simple to do:

bool keepSending = true; // set this to false to shut down the thread
var hb = new Thread(() => 
    {
         while (true)
             SendHeartbeatMessage();   
    }).Start();

Both of these however have a serious flaw, if the real working thread(s) in your app lock up (say going into an infinite loop or a deadlock) then you will continue to merrily send out OK messages. For the process based monitoring you will continue to see the process 'alive' despite it no longer performing it's real task.
You can improve the thread one in many ways (significantly increasing the complexity and chance threading issues) by layering on tests for progress on the main thread but this takes the wrong solution and tries to push it towards the right one.

What is best is to make the task(s) performed by the program part of the liveness check. Perhaps to heartbeat directly from the main thread after every sub task done (with a threshold to ensure that it does not happen too often) or to simply look at the output (if it exists) and ensure that the inputs are resulting in outputs.

It is better still to validate this both internally (within the program) and externally (especially if there are external consumers/users of the program). If you have a web server: attempt to use it, if your app is some event loop based system: trigger events to which it must respond (and verify the output is correct). Whatever is done consider always that you wish to verify that useful and correct behaviour is occurring rather than just any activity at all.

The more you verify of not only the existence of the program, but it's actions the more useful your check will be. You will check more of the system the further you put yourself from the internal state, if you run your monitor process on the box you may only check local loopback, running off the box validates much more of the network stack including often forgotten aspects like DNS.

Inevitably this makes the checking harder to do, because you are inherently thinking about a specific task rather than a general solution, the dividends from this should yield sufficient benefits for this approach to be seriously considered in many cases.

OTHER TIPS

It is hard to know if an app has crashed or is actually doing something useful.

Consider this:

 while(true);

The processor is (very) busy. And it might even respond if this is done in a separate thread. However, this is really unwanted behaviour since the app is not working anymore.

Best way to tackle this is to periodically (on certain points in the software) add certain counters and broadcast these. A watchdog app can listen for these broadcasts and if they don't arrive or make sense anymore(counter does not add up) then you can kill the process and restart it.

Broadcasting can be done in multiple ways. Easiest is to just write the counters to a file (make sure the file is locked when you write in it so a reading process doesn't get a half mangled file when it is reading it at the exact same time)

more advanced ways is to use named pipes, or to use a socket. UDP socket is very easy to setup and use in this case. Don't worry about 'packetloss' since on a local network this almost never happens

You can use polling mechanism and periodically ask the status of the remote application.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top