OpenCL as a general purpose C runtime compiler (Why is kernel on CPU slower than direct C?)

https://stackoverflow.com/questions/23146579

opencl

05-07-2023
|

Question

I have been playing with openCL as a general purpose means to execute run-time C code. While I'm interested in getting code to run eventually on the GPU, I am currently looking at the overhead of OpenCL on the CPU compared to running straight complied C.

Obviously there is overhead to the preping and compilation of the kernal in OpenCL. But even when I just time the final execution and buffer read:

clEnqueueNDRangeKernel(...);
clFinish();
clEnqueueReadBuffer(...);

The overhead for a simple calculation is significant compared to the straight C code (factor of 30)., even with 1000 loops, giving it an opportunity to parallelize. Obviously there is overhead with reading back a buffer from the OpenCL code... but it is sitting on the same CPU, so it can't be that large of overhead.

Does this sound right?

Solution

The call to clEnqueueReadBuffer() is most likely performing a memcpy-like operation which is costly and non-optimal.

If you want to avoid this, you should allocate your data on the host normally (e.g. with malloc() or new) and pass the CL_MEM_USE_HOST_PTR flag to clCreateBuffer() when you create the OpenCL buffer object. Then use clEnqueueMapBuffer() and clEnqueueUnmapMemObject() to pass the data to/from OpenCL without actually performing a copy. This should be close to optimal w.r.t. minimizing OpenCL overhead.

For a higher-level abstraction implementing this technique, take a look at the mapped_view<T> class in Boost.Compute (simple example here).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow