CUDA multi gpu: running same kernel (dual chip device)

https://stackoverflow.com/questions/12353337

01-07-2021
|

Question

I'm writing code that must run same kernel on GTX690 (dual chip) board. Since computation is pretty separable, I don't need data exchange between devices, I just merge the result on the CPU. I understand how to run code on each device, and also, how to provide i/o data space in memory space of each device.

I'm encountering problems when trying to set constants, which the kernel that runs on both devices uses. Do I need to make an array for each device? If so, how can I check, within the kernel, which device is running so I can access the proper constant from the array?

Here is some code that I am using. It works properly when i use 1 chip only (by setting numDev = 1), but not on both.

__constant__ float d_cellSizeZ;  
std::vector<int*> d_cell;  
.................  
bool Init(int cellsN_, float size_){  
  bool res = true;
  if(cudaSuccess != cudaGetDeviceCount(&numDev))  
    return false;  
  //numDev = 1;  
  d_cl.resize(numDev);  
  for(int i = 0; i < numDev; ++i){  
    res &= (cudaSuccess == cudaSetDevice(i));  
    if(!res)  
      break;  
    res &= (cudaSuccess == cudaMalloc((void**)&d_cell[i], cellsN_*sizef(int)));  
  };  
  res &= (cudaSuccess == cudaMemcpyToSymbol(d_cellSizeZ, &size_, sizeof(float)));  
  if(!res)  
    Cleanup();  
  return res;  
}

In the kernel I just use the d_cellSizeZ constant. So, how should I make a constant for each device to be used nicely from the same kernel?

Another question is: if I try exchanging data between devices, will it go through the PCI bus or does there exist some internal path on dual-chip boards?

Solution

You're only initialising d_cellSizeZ on the last device, so on the other devices it will be undefined. You need to initialise d_cellSizeZ on each of the devices, the easiest way is to do that inside the loop as Greg suggested in comments:

for(int i = 0; i < numDev; ++i)
{  
    checkCudaErrors(cudaSetDevice(i));
    checkCudaErrors(cudaMalloc((void**)&d_cell[i], cellsN_*sizef(int)));
    checkCudaErrors(cudaMemcpyToSymbol(d_cellSizeZ, &size_, sizeof(float)));
};

It does get a little weird around the reuse of the d_cellSizeZ symbol. There's a little bit of cleverness going on behind the scenes but essentially the cudaMemcpyToSymbol() function looks up the symbol on the currently active device, and therefore copies to the correct device each time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow