Detecting and recovering from Windows TDR?

Question 1

I managed to come up with a solution, although it's far from optimal.

I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.

While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.

This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.

Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.

Question 2

Ideally, you should get an error code from clFinish or clWaitForEvents with the OpenCL event object generated when enqueuing the kernel. Since TDR resets the graphics driver, I don't think any OpenCL implementation will work reliably, meaning there is no recovery route.

Rather disable TDR completely. It is only worthwhile when you debug code that gets stuck in an infinite loop that permanently keeps the GPU busy.

If you want to keep TDR but can change the code then using some sort of thread sleep function to delay your code for a few milliseconds could also alleviate this problem, at the expense of sacrificing processing speed. This gives the graphics card a chance to respond to display rendering commands so that TDR is not triggered.