Question

I'm currently learning CUDA, and I'm now focussing on the memory copy between host and device throughput. Here is a little program (see notes below) :

int NX=1000;
int NY=800;

int size=NX*NY;
size*=sizeof(PREC);
int threadsperbloc=512;
int blockspergrid=ceil(NX*NY/threadsperbloc);

//Allocate and instanciate host arrays
PREC *h_a;
PREC *h_b;
h_a=new PREC[NX*NY];
h_b=new PREC[NX*NY];
for (int i=0;i<NX*NY;i++){
    h_a[i]=i;
    h_b[i]=i;
}
//Allocate device arrays and a paged-locked host array to fetch results
PREC *d_a=NULL;
PREC *d_b=NULL;
PREC *d_c=NULL;
PREC *dh_c=NULL;
CUDA_CHECK(cudaMalloc(&d_a,size));
CUDA_CHECK(cudaMalloc(&d_b,size));
CUDA_CHECK(cudaMalloc(&d_c,size));
CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaHostAlloc(&dh_c,size,cudaHostAllocDefault));

//A little addition vector addition on the device
vecadd<<<blockspergrid,threadsperbloc>>>(d_a, d_b, d_c, NX*NY);

//Repeating copies from device to page-locked host memory
for(int t=0;t<30;t++){
    CUDA_CHECK(cudaMemcpy(dh_c,d_c,size,cudaMemcpyDeviceToHost));
}
cout<<"Check : "<<h_a[1000]<<" + "<< h_b[1000]<<" = "<<dh_c[1000]<<endl;

Notes : PREC is a macro (float in this case). Only one stream is used (the main one). I don't use the Async in this case, this is not the point. (I've tried but the throughput doesn't change).

With this little test, which I think is correct (and provides a correct numerical result), the Visual Profiler tells me that I only have a throughput of 1.52 GB/s (with an "attention" icon), each transfer being around 3MB (just for info). However using the NVIDIA bandWidthTest from the SDK it says

Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432         3177.5

I have a 3.0 compute capability and would like to achieve also the 3.smth GB/s of throughput. I've checked a bit the bandWidthTest.cu but I cannot see what I am doing different (they are using MemcpyAsync, but as I said, I also tried with no different result). EDIT : Maybe you've seen that the test from the SDK is doing a transfer af about 33MB. In 10 times in fact. I've tried 30*3MB, 10*12MB, but no change. So, what am I doing wrong?

Was it helpful?

Solution

The problem was actually other personal computer processes in place. I tried on a non-display GPU and it works fine.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top