How to work around GPU watchdog timer limitation on CUDA code in OS X

Question

Having worked with both the OpenCV GPU module and low level CUDA programming I have encountered this problem as well. The short answer is no - you cannot bypass the watchdog timer like you can via the Registry Keys in Windows - or rather - I never found a way of doing it even though I tried several suggestions on various CUDA Dev forums.

Due to the GPU architecture of NVidia GPU's it is not possible to save a GPU state as such. Generally, to compute anything on the GPU you initialize your data on the CPU and save it in your RAM, copy the data to the GPU global memory where the GPU cores can access it, do your computations, save the result in the global memory and copy it back to the CPU / RAM where the CPU can access the result and your kernel terminates releasing all data. When the watchdog timer kicks in the kernel is terminated and all your data is lost.

So technically - to work around this issue there is only two possible solutions: A workaround is to do only GPU computations that take less than than the 5 seconds timer (or whatever the timer is on your system), save the intermediate result to the CPU /RAM, and start a new kernel with the next data waiting in a queue. You keep doing this until you are done. However this have an large impact on your performance as you first have to split up your data, properly queue it, and copy data to and from the GPU several times - so you might lose a lot of performance depending on your data.

Another solution is to have two dedicated GPU's installed - one that works as the system GPU, and the other just sitting there crunching numbers when you tell it to. Atleast on Windows and Linux this works flawlessly without having to disable the watchdog timer. I do not know if the same holds for OSX as I have no experience with multiple CUDA GPU's on a Mac. CUDA exposes a function where you can manually set the device to be used:

http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__DEVICE_g418c299b069c4803bfb7cab4943da383.html

The default GPU is always index 0, and is, according to my experience, the one set by your system as your current display device. So setting the index to 1 will use the GPU currently not used by your system (NOTE that I am not sure the behavior is the same in a SLI setup). For example the Windows machine I used for testing had a 8800GT as a display device, and a TESLA C2075 on the side. Both supported CUDA so setting the TESLA as the CUDA device (index 1) manually, meant that the display device never froze - and therefore the watchdog never kicked in. The same happened on my linux machine with a GTX680/TESLA k20C combo.

It is worth noting that cudaSetDevice only knows of CUDA devices - so if you have an integrated GPU or an AMD GPU together with your Nvidia card then you cannot change between them with cudaSetDevice. It will ALWAYS use your CUDA enabled device, or fail altogether. As far as I know there are no cv::gpu:cudaSetDevice so I do not know if it is possible for you to call this function together with your OpenCV code. If you are using C and not C++ you might be able to use the NVCC compiler and actually call some native CUDA (like cudaSetDevice) functions before your OpenCV functions.

However with OpenCV you have much less control over what happens in the CUDA code (compared to writing your own kernel) and it might not be possible to actually split up your data and still get a satisfactory result. In that case I do not think there is a solution to your problem. On top of this OSX likes switching between multiple GPU's according to the current workload on the MacBook Pro.

Back when I had this problem on my MacBook pro I installed Windows 7 in my bootcamp along with VS2010 and the CUDA toolkit, disabled the watchdog timer and ran it perfectly. It is not a perfect solution, but at least it allowed me to develop my CUDA code locally before deploying it to my test server.