CUDA vs OpenCL performance on empty kernel

https://stackoverflow.com/questions/23493036

16-07-2023
|

Question

While measuring performance of the same kernel on CUDA and OpenCL, I've found one weird thing.

When I leave my kernel absolutely empty, without any input parameters and calculations, CUDA gives me very poor performance, in comparison to OpenCL.

CUDA kernel:

__global__ void kernel_empty()
{
}

CUDA host:

kernel_empty<<<dim3(10000, 10000, 1), dim3(8, 8, 1)>>>();

OpenCl kernel:

__attribute__((reqd_work_group_size(8, 8, 1)))
__kernel void kernel_empty()
{
}

OpenCL host:

cl_event perf_event;
size_t global_work_offset[3] = {0, 0, 0};
size_t global_work_size[3] = {10000, 10000, 1};
size_t local_work_size[3] = {8, 8, 1};
clEnqueueNDRangeKernel(queue, kernel, 3, global_work_offset, global_work_size, local_work_size, 0, NULL, &perf_event);

OpenCL gives 6ms

CUDA gives 390ms

Kernels on both APIs are working correctly, since I'm using them to calculate my stuff.
There are no error codes returned on both sides.
Visual Studio 2010 is used, release mode
OpenCL 1.1 lib from NVIDIA GPU Computing Toolkit 5.5
CUDA lib from NVIDIA GPU Computing Toolkit 5.5
Timings are also correct, I've double checked them with CPU timer. Also when using huge grid, you can see how CUDA lags without any timers.
- For OpenCL clGetEventProfilingInfo is used.
- For CUDA cudaEventElapsedTime is used.
The tests were running on the same PC with NVIDIA Quadro K4000.

Could someone explain why there is such huge difference?

Solution

The way in which kernels are launched in OpenCL and CUDA is different, and so actually you are launching different amounts of work for each method.

In OpenCL, you specify the global work size (total amount of work-items to launch), and the local work size (the work-group size). In your example, you are launching 10000*10000 work-items in groups of 8x8.

In CUDA, you specify the block size (analogous to work-group size), and the grid size, which is how many blocks to launch. This means that your CUDA example is launching 10000x10000 blocks, which is a total of 80000x80000 CUDA threads.

So, this CUDA kernel launch:

kernel_empty<<<dim3(10000, 10000, 1), dim3(8, 8, 1)>>>();

is equivalent to this OpenCL kernel enqueue:

size_t global_work_size[3] = {80000, 80000, 1};
size_t local_work_size[3] = {8, 8, 1};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL, global_work_size, local_work_size, 0, NULL, &perf_event);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow