Question

I've run into an odd issue with some OpenCL code that I'm working on where every once in a blue moon, Windows TDR will kick in and reset the GPU. The offending kernel runs for only 150ms and will run thousands of times (over the course of many hours) before the TDR kills it off, so I'm certain that the kernel itself isn't to blame.

My concern is that once the TDR kicks in, the kernel dies and the program is stuck in an eternal state of limbo. From what I can tell the call to clFinish never returns.

Is there a way to detect if a kernel has died off so that it can be handled gracefully?

Was it helpful?

Solution

I managed to come up with a solution, although it's far from optimal.

I've modified the program so that the OpenCL processing is done in a separate thread. I created a global shared watchdog variable between the parent and child process. When the parent spawns the processing function as a thread, it sets the variable to the current time in milliseconds. When the processing thread finishes, it reset the watchdog variable to zero.

While the parent thread waits for the processing thread to finish, it keeps an eye on the watchdog timer. If the timer exceeds a certain threshold then the program forcefully terminates itself without waiting for the child process to return.

This solution works with or without Windows TDR set. If TDR is set and the driver resets, the call to clFinish() will never return and the parent will terminate once the watchdog timer trips. If TDR is not set, the runaway process will freeze the display, but once the watchdog timer trips, the parent will terminate processing, ending the freeze.

Now that I have a watchdog set up, I simply wrapped my program in a script: if it terminated in error (positive return code) then the program is rerun.

OTHER TIPS

Ideally, you should get an error code from clFinish or clWaitForEvents with the OpenCL event object generated when enqueuing the kernel. Since TDR resets the graphics driver, I don't think any OpenCL implementation will work reliably, meaning there is no recovery route.

Rather disable TDR completely. It is only worthwhile when you debug code that gets stuck in an infinite loop that permanently keeps the GPU busy.

If you want to keep TDR but can change the code then using some sort of thread sleep function to delay your code for a few milliseconds could also alleviate this problem, at the expense of sacrificing processing speed. This gives the graphics card a chance to respond to display rendering commands so that TDR is not triggered.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top