Question

I am working on a Java program that launches a child process, receives data through stdout and performs some calculation and this process repeats. I run this program on a supercomputer that uses a Torque-related PBS with some special scheduling feature that suspends jobs periodically in such as way as to maximise system utilisation.

One problem I had during execution was an instance where my child process mysteriously hung (cause currently unknown), causing Java to wait for a response that was never going to arrive. What I would like to do is monitor this process and enforce an execution time cutoff, ie., if the process runs for an unusual amount of time, die and and throw some kind of error letting me know that this happened.

Normally, I would use an Apache commons exec watchdog to do this. But I am worried that any time this job spends suspended will contribute to this cutoff (assuming it uses the difference between start and finish System.currentTimeMillis()). Would an Apache commons exec watchdog suffer from this? Is there any way to exclude any suspend time in the elapsed time calculation?

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top