PyCUDA; how to distribute workload to multiple devices dynamically

https://stackoverflow.com/questions/5810401

25-10-2019
|

Question

PyCUDA, for all its faults, usually has very good examples provided with it / downloadable from the wiki. But I couldn't find anything in the examples or in the documentation (or a cursory google search) demonstrating the PyCUDA way of dyanmically allocating workloads to multiple devices.

Can anyone either hint me toward what I should be doing or point me to examples?

One idea that popped into my head was using multiprocessing, generating a pool of N processes, each tied to one device, and then when the class is called (I have all my gpu functions in a separate class; probably not the best idea but it works) it round-robin's the multiprocesses. How good / retarded an idea is this?

PS My dev machine is 1 GPU and my test machine in 4 GPU, so I need whatever solution to be able to deal with a dynamic number of devices (it also doesn't help that they're different compute capabilities, but thats life)

Solution

Pycuda hasn't had any intrinsic multiple-GPU support because CUDA also hasn't had any intrinsic multiple-GPU support. This will change in CUDA 4.0 because the API has been changed to be thread safe and multi-GPU aware. But Pycuda doesn't yet have that support AFAIK. Even when it comes, each device has to be explicitly managed, and the workload divided by you. There is no automatic workload distribution or anything like that.

For multi-GPU, I have normally used mpi4py. You could potentially use a multithreaded python scheme, with each thread opening a separate context in Pycuda. What works best will probably depend on how much communication is required between devices.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow