What device number should I use (0 or 1), to copy P2P (GPU0->GPU1)?

https://stackoverflow.com/questions/22736159

23-06-2023
|

Question

What number of device do I must to set 0 or 1 in cudaSetDevice();, to copy P2P (GPU0->GPU1) by using cudaStreamCreate(stream); cudaMemcpyPeerAsync(p1, 1, p0, 0, size, stream); ?

Code:

// Set device 0 as current
cudaSetDevice(0); 
float* p0;
size_t size = 1024 * sizeof(float);
// Allocate memory on device 0
cudaMalloc(&p0, size); 
// Set device 1 as current
cudaSetDevice(1); 
float* p1;
// Allocate memory on device 1
cudaMalloc(&p1, size); 
// Set device 0 as current
cudaSetDevice(0);
// Launch kernel on device 0
MyKernel<<<1000, 128>>>(p0); 

// What number do I must to set 0 or 1?
cudaSetDevice(1); // cudaSetDevice(0); 
cudaStream_t stream;
cudaStreamCreate(stream);

// Copy p0 to p1
cudaMemcpyPeerAsync(p1, 1, p0, 0, size, stream); 
cudaStreamSynchronize(stream);

// Launch kernel on device 1
cudaSetDevice(1); 
MyKernel<<<1000, 128>>>(p1);

UPDATE 31.03.2014: Or does the current context important only for __global__ kernel_function(), not for cudaMemcpyPeerAsync()? And for cudaMemcpyAsync() and cudaMemcpyPeerAsync() is only important that stream has been created for the device from (source pointer) which the data is copied, isn't it?

Solution

In the call to cudaMemcpyPeerAsync you can specify a non-default stream. So your first question is: which device should I set by cudaSetDevice before the call to cudaMemcpyPeerAsync?

The answer is that you have to set, by cudaSetDevice, the device for which the stream has been created. You can either use a stream created for the source or for the destination device. Although, at the best of my knowledge, not explicitly mentioned in the documentation, this possibility can be inferred by Robert Crovella's answer to How to define destination device stream in cudaMemcpyPeerAsync()?. Please, note that, as of 2011 and according to Multi-GPU Programming, performance is maximized when stream belongs to the source GPU.

Let me recall some important points when using streams in the framework of multi-GPU, borrowed from Multi-GPU Programming, and which support the above statements:

CUDA streams are per device;
streams are determined by the GPU that was current at the time of their creation;
Calls to a stream can be issued only when its device is current.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow