ClCreateBuffer(| CL_MEM_USE_HOST_PTR): When does OpenCL framework transfer data to device via PCI?

Question

You need to use the manufacturer supplied tools (I think vtune amplifier did the job on Intel hardware) to see what actually happens in the device, as the OpenCL spec intentionally allows the implementation leeway on when to actually perform things.

So I can only give you the points on when the device is permitted to do work and when it's actually forced to do it.

Right after you call

ClCreateBuffer(hostPtr, flag, ...)

The device is allowed to begin reading the data. It can do this while your program runs normally as you are not permitted to write there until you call EnqueueMapBuffer. It's extremely likely that your call to EnqueueNDRangeKernel comes before the transfer is complete so it just hangs around in the command queue.

All these lines and the device is only permitted to work, nothing has yet forced it to work so in some cases it might not have actually done anything yet. But then comes the call that forces it to evaluate everything/wait for the calls to finish, assuming that you set it as a blocking call.

ClEnqueueMapBuffer(hostPtr, ...)

If you run this call with blocking_map as true you actually will get the ready made data back as of this moment. The implementation makes you wait inside that call until the data is in device, is processed by the kernel and then written back.

If you don't run this as a blocking map then the data is not necessarily back yet. So you have just issued 3 non blocking calls and the device can do whatever it wishes.

tl;dr: Everything from write, execution to read can happen inside the blocking clEnqueueMapBuffer call.