CUDA: why is a particular memo copy operation always costs 10 times more than other similar ones

https://stackoverflow.com/questions/16331163

14-04-2022
|

Question

I believe the following code carry out a typical

copy to device
call kernel
copy back to host

workflow.

What I found that was very strange is when I used Trace Application option by the NSight Profiler, in the report, with 'stack trace' turned on, I found out that the most expensive operation is the line in bold, and just that line, while other memoCopy operation cost almost only as 10% or less of this memoCopy operation.

Is this because it is the first line after calling the kernel hence the profiler somehow included the cost of some synchronization into the cost of this particular memoCopy operation?
For the problem like the one I'm working on, which requires very frequent synchronization and 'returning' the result to host, can anyone offer some general advice on the best practice? I was thinking in particularly about two options, which I am not so sure if will eventually help
- use 'zero-copy' memory, (CUDA by Example 11.2)
- create my how synchronization using atomic operations

{

int numP = p_psPtr->P.size();
int numL = p_psPtr->L.size();

// Out partition is in Unit of the Number of Particles
int block_dim = BLOCK_DIM_X;
int grid_dim = numP/block_dim + (numP%block_dim == 0 ? 0:1);

vector<Particle> pVec(p_psPtr->P.begin(), p_psPtr->P.end());
Particle *d_part_arr = 0;
Particle *part_arr = pVec.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_part_arr, numP * sizeof(Particle)));
HANDLE_ERROR(cudaMemcpy(d_part_arr, part_arr, numP * sizeof(Particle), cudaMemcpyHostToDevice));

vector<SpringLink> lVec(p_psPtr->L.begin(), p_psPtr->L.end());
SpringLink *d_link_arr = 0;
SpringLink *link_arr = lVec.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_link_arr, numL * sizeof(SpringLink)));
HANDLE_ERROR(cudaMemcpy(d_link_arr, link_arr, numL * sizeof(SpringLink), cudaMemcpyHostToDevice));

Point3D *d_oriPos_arr = 0;
Point3D *oriPos_arr = p_originalPos.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_oriPos_arr, numP * sizeof(Point3D)));
HANDLE_ERROR(cudaMemcpy(d_oriPos_arr, oriPos_arr, numP * sizeof(Point3D), cudaMemcpyHostToDevice));

Vector3D *d_oriVel_arr = 0;
Vector3D *oriVel_arr = p_originalVel.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_oriVel_arr, numP * sizeof(Vector3D)));
HANDLE_ERROR(cudaMemcpy(d_oriVel_arr, oriVel_arr, numP * sizeof(Vector3D), cudaMemcpyHostToDevice));

Point3D *d_updPos_arr = 0;
Point3D *updPos_arr = p_updatedPos.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_updPos_arr, numP * sizeof(Point3D)));
HANDLE_ERROR(cudaMemcpy(d_updPos_arr, updPos_arr, numP * sizeof(Point3D), cudaMemcpyHostToDevice));

Vector3D *d_updVel_arr = 0;
Vector3D *updVel_arr = p_updatedVel.data(); 
HANDLE_ERROR(cudaMalloc((void**)&d_updVel_arr, numP * sizeof(Vector3D)));
HANDLE_ERROR(cudaMemcpy(d_updVel_arr, updVel_arr, numP * sizeof(Vector3D), cudaMemcpyHostToDevice));

int *d_converged_arr = 0;
int *converged_arr = &p_converged[0]; 
HANDLE_ERROR(cudaMalloc((void**)&d_converged_arr, numP * sizeof(int)));
HANDLE_ERROR(cudaMemcpy(d_converged_arr, converged_arr, numP * sizeof(int), cudaMemcpyHostToDevice));

// Run the function on the device
handleParticleKernel<<<grid_dim, block_dim>>>(d_part_arr, d_link_arr, numP,
    d_oriPos_arr, d_oriVel_arr, d_updPos_arr, d_updVel_arr, 
    d_converged_arr, p_innerLoopIdx, p_dt);

**HANDLE_ERROR(cudaMemcpy(oriPos_arr, d_oriPos_arr, numP * sizeof(Point3D), cudaMemcpyDeviceToHost));**
HANDLE_ERROR(cudaMemcpy(oriVel_arr, d_oriVel_arr, numP * sizeof(Vector3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(updPos_arr, d_updPos_arr, numP * sizeof(Point3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(updVel_arr, d_updVel_arr, numP * sizeof(Vector3D), cudaMemcpyDeviceToHost));
HANDLE_ERROR(cudaMemcpy(converged_arr, d_converged_arr, numP * sizeof(int), cudaMemcpyDeviceToHost));

}

Solution

That particular cudaMemcpy call takes longer because it has wait until your kernel completes. If you add in a cudaDeviceSynchronize after the kernel, your perceived execution time of that cudaMemcpy call should be in line with all of the others. (Of course, that additional time you're seeing will instead be spent in your cudaDeviceSynchronize call).

However, the time you spend in cudaDeviceSynchronize is somewhat of a fundamental cost that you can't really get around; if you need to use an output from your kernel, then you'll have to wait until the kernel is done executing. Since kernels launches are asynchronous, you can execute un-related statements while the kernel is running; however, in your case, the very next call is copying one of the outputs of your kernel to host memory, so you have to wait for the kernel to finish in order to get the data.

If your program permits, you can try breaking up your kernel launch and memory transfers into chunks and launching them using different streams, although the viability of this is contingent on a couple of factors (i.e. your kernel might not decompose well into independent parts). If you do go this route, the best case scenario would be like this (taken from the CUDA Best Practices Docs)

enter image description here

This would allow you to overlap data transfers with kernel execution, which serves to hide some of the data transfer costs. You can achieve similar asynchrony with zero-copy, just be forewarned that such transfers aren't cached, so depending on your kernel access patterns, you can end up getting lower throughputs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow