문제

Following is my code which tries to achieve non-blocking cuda memory copy host to device.

for (i = 0; i < ldu; ++i)
{
     cudaMemcpyAsync(dA+i*num_row, &A+i*LDA,
         num_row*sizeof(double), cudaMemcpyHostToDevice,streams[0]) ; 

}

Average time for each such call is around 10 microseconds. I have tried blocking version which takes 30 microseconds. 10 microseconds, for a non-blocking call seems a lot. A is allocated using cudaHostalloc. I run my code on machine equipped with 1 single Tesla C2050 , and I use cuda version 5.5 to compile the code. I have read that gpu PCI-transfer latency ( a bit irrelevant to non-blocking call, but to give an idea about the order of time) is around 5us. So it return time for a non-blocking call to be 10 us is a bit on higher side. Any thing I can do to speed it up?

Couple of things that I tried were, putting an openmp pragma (which resulted in slow down), sending data using different streams ( which gave around the same average time)

도움이 되었습니까?

해결책

Any thing I can do to speed it up?

I believe you can replace your copy loop with:

cudaMemcpy2DAsync(dA, num_row*sizeof(double), &A, LDA*sizeof(double), num_row*sizeof(double), ldu, cudaMemcpyHostToDevice, streams[0]);

which should speed things up (at least from a call overhead standpoint) significantly.

You may have to play with your parameters a bit, as your names are somewhat confusing to me (maybe you are using column-major storage). The cudaMemcpy2DAsync function is documented here.

다른 팁

Fermi gpus has only one copy engine for each direction. So, all copy commands in the same direction are serialized, no matter they are async or not.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top