Just insert an explicit sync point with cudaDeviceSynchronize()
after the kernel launch.
That way, you are essentially starting a memory transfer and launching a kernel at the same time. The transfer would go to one buffer and the kernel would work on the other. The cudaDeviceSynchronize() would wait until both were done, at which time you would swap the buffers and repeat.
Of course, you also need to copy the results from the device to the host within the loop and add logic to handle the first iteration, when there's no data for the kernel to process yet, and the last iteration, when there's no more data to copy but still one buffer to be processed. This can be done with logic within the loop or by partially unrolling the loop, to specifically code the first and last iterations.
Edit:
By moving the sync point to just before the cudaMemcpyAsync()
and after the file read and parse, you allow the kernel to also overlap that part of the processing (if the kernel runs long enough).