The call to clEnqueueReadBuffer()
is most likely performing a memcpy
-like operation which is costly and non-optimal.
If you want to avoid this, you should allocate your data on the host normally (e.g. with malloc()
or new
) and pass the CL_MEM_USE_HOST_PTR
flag to clCreateBuffer()
when you create the OpenCL buffer object. Then use clEnqueueMapBuffer()
and clEnqueueUnmapMemObject()
to pass the data to/from OpenCL without actually performing a copy. This should be close to optimal w.r.t. minimizing OpenCL overhead.
For a higher-level abstraction implementing this technique, take a look at the mapped_view<T>
class in Boost.Compute (simple example here).