Memory transfer between host and device in OpenCL?

Question

At what step is the memory transferred to the device from the host?

The only guarantee you have is that the data will be on the device by the time the kernel begins execution. The OpenCL specification deliberately doesn't mandate when these data transfers should happen, in order to allow different OpenCL implementations to make decisions that are suitable for their own hardware. If you only have a single device in the context, the transfer could be performed as soon as you create the buffer. In my experience, these transfers usually happen when the kernel is enqueued (or soon after), because that is when the implementation knows that it really needs the buffer on a particular device. But it really is completely up to the implementation.

How do I measure the time required for transferring the memory from host to device?

Use a profiler, which usually shows when these transfers happen and how long they take. If you transfer the data with clEnqueueWriteBuffer instead, you could use the OpenCL event profiling system.

How do I measure the time required for transferring the memory from device's global memory to private memory?

Again, use a profiler. Most profilers will have a metric for the achieved bandwidth when reading from global memory, or something similar. It's not really an explicit transfer from global to private memory though.

Is the memory still transferred if the device is same as host device?

With CL_MEM_COPY_HOST_PTR, yes. If you don't want a transfer to happen, use CL_MEM_USE_HOST_PTR instead. With unified memory architectures (e.g. integrated GPU), the typical recommendation is to use CL_MEM_ALLOC_HOST_PTR to allocate a device buffer in host-accessible memory (usually pinned), and access it with clEnqueueMapBuffer.

Will the time required to transfer from host to device be greater than the time required for transferring from device's global memory to private memory?

Probably, but this will depend on the architecture, whether you have a unified memory system, and how you actually access the data in kernel (memory access patterns and caches will have a big effect).