The way in which kernels are launched in OpenCL and CUDA is different, and so actually you are launching different amounts of work for each method.
In OpenCL, you specify the global work size (total amount of work-items to launch), and the local work size (the work-group size). In your example, you are launching 10000*10000 work-items in groups of 8x8.
In CUDA, you specify the block size (analogous to work-group size), and the grid size, which is how many blocks to launch. This means that your CUDA example is launching 10000x10000 blocks, which is a total of 80000x80000 CUDA threads.
So, this CUDA kernel launch:
kernel_empty<<<dim3(10000, 10000, 1), dim3(8, 8, 1)>>>();
is equivalent to this OpenCL kernel enqueue:
size_t global_work_size[3] = {80000, 80000, 1};
size_t local_work_size[3] = {8, 8, 1};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &perf_event);