MPI+CUDA mixed programming - Driver error

https://stackoverflow.com/questions/20695367

19-09-2022
|

题

I'm using MPI+CUDA mixed mode to program a GPU cluster for matrix multiplication. When I offload the multiplication operations to the GPUs via MPI and CUDA, it gives an error message at run time:

FATAL: Error inserting nvidia (/lib/modules/3.2.0-23-generic-pae/kernel/drivers/video/nvidia.ko): No such device

MPI is used to transfer the data blocks and then upon receiving the data, a generic C function is called that triggers a CUDA kernel. Test setup has 3 machines, each has single GPU. I tested with a CUDA only local version version. I didn't get any error messages, but the answers of the algorithms were wrong (Even for the small simple algorithms).

What's the reason for this error? Please note that this is only when I try to use the MPI with CUDA. CUDA only version works well. Thanks in advance.

解决方案

The errors have been caused because Nouveau is controlling the GPU, not the NVIDIA driver. So, before installing NVIDIA driver and CUDA toolkit, nouveau should be blacklisted.

sudo nano /etc/modprobe.d/blacklist.conf

Insert nouveau at the end of the file.

If the NVIDIA driver is already installed, then re-install the NVIDIA driver.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow