Question

I have a CUDA kernel that process a lot of data. As I cannot transfer all the data at once I have to split them into chunks and process them chuck by chunk and update the output on the GPU. I am parsing the input data from a file. I was thinking if i could overlap the chunks' memory transfers by having two buffers both in the host and in the GPU. While processing one chuck, I could read the other, transfer it to the GPU and launch the kernel to the same stream. My problem is that the kernel's execution time is slower than parsing the data and transferring them to the GPU. How can I ensure that the memcpys won't write over the data that the kernel uses given the fact that memcpys are no blocking?

//e.g. Pseudocode
//for every chunk
   //parse data
   //cudaMemcpyAsync ( dev, host, size, H2D )
   //launch kernel
   //switch_buffer
//copy result from device to host

Thank you in advance.

Was it helpful?

Solution

Just insert an explicit sync point with cudaDeviceSynchronize() after the kernel launch.

That way, you are essentially starting a memory transfer and launching a kernel at the same time. The transfer would go to one buffer and the kernel would work on the other. The cudaDeviceSynchronize() would wait until both were done, at which time you would swap the buffers and repeat.

Of course, you also need to copy the results from the device to the host within the loop and add logic to handle the first iteration, when there's no data for the kernel to process yet, and the last iteration, when there's no more data to copy but still one buffer to be processed. This can be done with logic within the loop or by partially unrolling the loop, to specifically code the first and last iterations.

Edit:

By moving the sync point to just before the cudaMemcpyAsync() and after the file read and parse, you allow the kernel to also overlap that part of the processing (if the kernel runs long enough).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top