Question

Following is my code which tries to achieve non-blocking cuda memory copy host to device.

for (i = 0; i < ldu; ++i)
{
     cudaMemcpyAsync(dA+i*num_row, &A+i*LDA,
         num_row*sizeof(double), cudaMemcpyHostToDevice,streams[0]) ; 

}

Average time for each such call is around 10 microseconds. I have tried blocking version which takes 30 microseconds. 10 microseconds, for a non-blocking call seems a lot. A is allocated using cudaHostalloc. I run my code on machine equipped with 1 single Tesla C2050 , and I use cuda version 5.5 to compile the code. I have read that gpu PCI-transfer latency ( a bit irrelevant to non-blocking call, but to give an idea about the order of time) is around 5us. So it return time for a non-blocking call to be 10 us is a bit on higher side. Any thing I can do to speed it up?

Couple of things that I tried were, putting an openmp pragma (which resulted in slow down), sending data using different streams ( which gave around the same average time)

Was it helpful?

Solution

Any thing I can do to speed it up?

I believe you can replace your copy loop with:

cudaMemcpy2DAsync(dA, num_row*sizeof(double), &A, LDA*sizeof(double), num_row*sizeof(double), ldu, cudaMemcpyHostToDevice, streams[0]);

which should speed things up (at least from a call overhead standpoint) significantly.

You may have to play with your parameters a bit, as your names are somewhat confusing to me (maybe you are using column-major storage). The cudaMemcpy2DAsync function is documented here.

OTHER TIPS

Fermi gpus has only one copy engine for each direction. So, all copy commands in the same direction are serialized, no matter they are async or not.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top