Вопрос

Following is my code which tries to achieve non-blocking cuda memory copy host to device.

for (i = 0; i < ldu; ++i)
{
     cudaMemcpyAsync(dA+i*num_row, &A+i*LDA,
         num_row*sizeof(double), cudaMemcpyHostToDevice,streams[0]) ; 

}

Average time for each such call is around 10 microseconds. I have tried blocking version which takes 30 microseconds. 10 microseconds, for a non-blocking call seems a lot. A is allocated using cudaHostalloc. I run my code on machine equipped with 1 single Tesla C2050 , and I use cuda version 5.5 to compile the code. I have read that gpu PCI-transfer latency ( a bit irrelevant to non-blocking call, but to give an idea about the order of time) is around 5us. So it return time for a non-blocking call to be 10 us is a bit on higher side. Any thing I can do to speed it up?

Couple of things that I tried were, putting an openmp pragma (which resulted in slow down), sending data using different streams ( which gave around the same average time)

Это было полезно?

Решение

Any thing I can do to speed it up?

I believe you can replace your copy loop with:

cudaMemcpy2DAsync(dA, num_row*sizeof(double), &A, LDA*sizeof(double), num_row*sizeof(double), ldu, cudaMemcpyHostToDevice, streams[0]);

which should speed things up (at least from a call overhead standpoint) significantly.

You may have to play with your parameters a bit, as your names are somewhat confusing to me (maybe you are using column-major storage). The cudaMemcpy2DAsync function is documented here.

Другие советы

Fermi gpus has only one copy engine for each direction. So, all copy commands in the same direction are serialized, no matter they are async or not.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top