There is an example in OpenCL NVIDIA SDK, oclCopyComputeOverlap, that uses 2 queues to alternatively transfer buffers / execute kernels.
In this example mapped memory is used.
**//pinned memory**
cmPinnedSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, szBuffBytes, NULL, &ciErrNum);
**//host pointer for pinned memory**
fSourceA = (cl_float*)clEnqueueMapBuffer(cqCommandQueue[0], cmPinnedSrcA, CL_TRUE, CL_MAP_WRITE, 0, szBuffBytes, 0, NULL, NULL, &ciErrNum);
...
**//normal device buffer**
cmDevSrcA = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, szBuffBytes, NULL, &ciErrNum);
**//write half the data from host pointer to device buffer**
ciErrNum = clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL);
I have 2 questions:
1) Is there any need to use pinned memory for the overlap to occur? Couldn't fSourceA be just a simple host pointer,
fSourceA = (cl_float *)malloc(szBuffBytes);
...
//write random data in fSourceA
2) cmPinnedSrcA is not used in the kernel, instead cmDevSrcA is used. Doesn't the space occupied by the buffers on the device still grow? (space required for cmPinnedSrcA added to the space required for cmDevSrcA)
Thank you