Measuring and minimization of OpenCL overhead

Question

How do you measure the OCL time? Using only something like:

my_event.profile.end - my_event.profile.start

If it’s the case you can also take another metric like that:

my_event.profile.start - my_event.profile.queued

This metric measure the time spent in the user application as well as in the runtime before execution hence the overhead. This metric is suggested in the AMD programing guide in section 4.4.1.
They also give a warning about profiling explaining that commands can be sent by batch and therefore

Commands submitted as batch report similar start times and the same end time.

If I well recall, NVIDIA streams commands. But in any case you can use that to reduce the overhead. For instance instead of having:

Cl_prog.kernel1(…).wait()
Cl_prog.kernel2(…).wait()

You could do something like:

Event1 =   Cl_prog.kernel1(…)
Event2 = Cl_prog.kernel2(…)
Event1.wait()
Event2.wait()

And so on.
But I digress; now to answer specifically to your questions here are some input taken from the same section I mentioned above (It's from AMD but I guess it should be pretty much the same for NVIDIA):

"For CPU devices, the kernel launch time is fast (tens of µs), but for discrete GPU devices it can be several hundreds µs"
See quote above
"Enabling profiling on a command queue adds approximately 10 μs to 40 μs overhead to all clEnqueue calls".