I'm using MPI+CUDA mixed mode to program a GPU cluster for matrix multiplication. When I offload the multiplication operations to the GPUs via MPI and CUDA, it gives an error message at run time:

FATAL: Error inserting nvidia (/lib/modules/3.2.0-23-generic-pae/kernel/drivers/video/nvidia.ko): No such device

MPI is used to transfer the data blocks and then upon receiving the data, a generic C function is called that triggers a CUDA kernel. Test setup has 3 machines, each has single GPU. I tested with a CUDA only local version version. I didn't get any error messages, but the answers of the algorithms were wrong (Even for the small simple algorithms).

What's the reason for this error? Please note that this is only when I try to use the MPI with CUDA. CUDA only version works well. Thanks in advance.

有帮助吗?

解决方案

The errors have been caused because Nouveau is controlling the GPU, not the NVIDIA driver. So, before installing NVIDIA driver and CUDA toolkit, nouveau should be blacklisted.

sudo nano /etc/modprobe.d/blacklist.conf

Insert nouveau at the end of the file.

If the NVIDIA driver is already installed, then re-install the NVIDIA driver.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top