The problem was actually other personal computer processes in place. I tried on a non-display GPU and it works fine.
CUDA | Cannot get high throughput using hostalloc
-
30-05-2022 - |
Question
I'm currently learning CUDA, and I'm now focussing on the memory copy between host and device throughput. Here is a little program (see notes below) :
int NX=1000;
int NY=800;
int size=NX*NY;
size*=sizeof(PREC);
int threadsperbloc=512;
int blockspergrid=ceil(NX*NY/threadsperbloc);
//Allocate and instanciate host arrays
PREC *h_a;
PREC *h_b;
h_a=new PREC[NX*NY];
h_b=new PREC[NX*NY];
for (int i=0;i<NX*NY;i++){
h_a[i]=i;
h_b[i]=i;
}
//Allocate device arrays and a paged-locked host array to fetch results
PREC *d_a=NULL;
PREC *d_b=NULL;
PREC *d_c=NULL;
PREC *dh_c=NULL;
CUDA_CHECK(cudaMalloc(&d_a,size));
CUDA_CHECK(cudaMalloc(&d_b,size));
CUDA_CHECK(cudaMalloc(&d_c,size));
CUDA_CHECK(cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice));
CUDA_CHECK(cudaHostAlloc(&dh_c,size,cudaHostAllocDefault));
//A little addition vector addition on the device
vecadd<<<blockspergrid,threadsperbloc>>>(d_a, d_b, d_c, NX*NY);
//Repeating copies from device to page-locked host memory
for(int t=0;t<30;t++){
CUDA_CHECK(cudaMemcpy(dh_c,d_c,size,cudaMemcpyDeviceToHost));
}
cout<<"Check : "<<h_a[1000]<<" + "<< h_b[1000]<<" = "<<dh_c[1000]<<endl;
Notes : PREC is a macro (float in this case). Only one stream is used (the main one). I don't use the Async in this case, this is not the point. (I've tried but the throughput doesn't change).
With this little test, which I think is correct (and provides a correct numerical result), the Visual Profiler tells me that I only have a throughput of 1.52 GB/s (with an "attention" icon), each transfer being around 3MB (just for info). However using the NVIDIA bandWidthTest from the SDK it says
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 3177.5
I have a 3.0 compute capability and would like to achieve also the 3.smth GB/s of throughput. I've checked a bit the bandWidthTest.cu but I cannot see what I am doing different (they are using MemcpyAsync, but as I said, I also tried with no different result). EDIT : Maybe you've seen that the test from the SDK is doing a transfer af about 33MB. In 10 times in fact. I've tried 30*3MB, 10*12MB, but no change. So, what am I doing wrong?
Solution