How does a barrier work for OpenCl Kernel?

https://stackoverflow.com/questions/22937047

opencl

29-06-2023
|

Question

Kernel code:

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void calculate (__global double* in)
{
    int idx = get_global_id(0); // statement 1
    printf("started for %d workitem\n", idx); // statement 2
    in[idx] = idx + 100; // statement 3
    printf("value changed to %lf in %d workitem\n", in[idx], idx); // statement 4
    barrier(CLK_GLOBAL_MEM_FENCE); // statement 5
    printf("completed for %d workitem\n", idx); // statement 6
}

I am calling kernel using clEnqueueNDRangeKernel, by passing an argument of array of double datatype having 5 elements with value initialized to 0.0

i am calling kernel with 5 global_work_size, hence each element of array i will solve on each workitem.

But as per my theoritical understanding of barriers, To synchronize work-items in a work-group, OpenCL provides a similar capability with the barrier function. This forces a work-item to wait until every other work-item in the group reaches the barrier. By creating a barrier, you can make sure that every work-item has reached the same point in its processing. This is a crucial concern when the work-items need to finish computing an intermediate result that will be used in future computation.

Hence, i was expecting an output like:

started for 0 workitem
started for 1 workitem
value changed to 100.000000 in 0 workitem
value changed to 101.000000 in 1 workitem
started for 3 workitem
value changed to 103.000000 in 3 workitem
started for 2 workitem
value changed to 102.000000 in 2 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem

completed for 3 workitem
completed for 0 workitem
completed for 1 workitem
completed for 2 workitem
completed for 4 workitem

these completed statements, will come at the end together because of barrier will restrict other work items till reaching that point.

But, result i am getting,

started for 0 workitem
value changed to 100.000000 in 0 workitem
completed for 0 workitem
started for 4 workitem
value changed to 104.000000 in 4 workitem
completed for 4 workitem
started for 1 workitem
started for 2 workitem
started for 3 workitem
value changed to 101.000000 in 1 workitem
value changed to 103.000000 in 3 workitem
completed for 3 workitem
value changed to 102.000000 in 2 workitem
completed for 2 workitem
completed for 1 workitem

Am i missing something in logic? then, How does a barrier work for OpenCl Kernel?

Added more checks in kernel for cross checking updated values after Barrier instead of print statements.

#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_amd_printf : enable

__kernel void calculate (__global double* in)
{
    int idx = get_global_id(0);
    in[idx] = idx + 100;
    barrier(CLK_GLOBAL_MEM_FENCE);
    if (idx == 0) {
        in[0] = in[4];
        in[1] = in[3];
        in[2] = in[2];
        in[3] = in[1];
        in[4] = in[0];
    }
}

then after array should be

after arr[0] = 104.000000
after arr[1] = 103.000000
after arr[2] = 102.000000
after arr[3] = 101.000000
after arr[4] = 100.000000

But results, i am getting:

after arr[0] = 0.000000
after arr[1] = 101.000000
after arr[2] = 102.000000
after arr[3] = 103.000000
after arr[4] = 104.000000

Solution

The code looks perfectly fine, I doubt about the size of local work-group, if you have not specified local work-group size, OpenCL compiler chooses best based on some checks (and generally it is ONE).

Check your clEnqueueNDRangeKernel call w.r.t below call

size_t global_item_size = 5; //Specifies no. of total work items
size_t local_item_size = 5; // Specifies no. of work items per local group
clEnqueueNDRangeKernel( command_queue, kernel, 1, NULL, &global_item_size,    &local_item_size, 0, NULL, NULL );

NOTE: This answer is with assumption that either you have not specified local work group size or its not set properly as per your requirement.

Little more on work Group::

Barrier will block all threads in work group, as you have not specified the work group size (its size is considered as one) and you will have 5 work groups each having only one thread.

OTHER TIPS

Yes, you are missing the fact that adding a printf() makes all the result orders invalid.

In fact, OpenCL states that the use of printf() is implemetation defined and In the case that printf is executed from multiple work-items concurrently, there is no guarantee of ordering with respect to written data. The simple logic will tell you that the queue will be flushed in order for each WI, since that is the easier way to serialize a flush after a parallel execution has filled many buffers (one per each WI printf).

They are executing in the order you expect, but the output flush of the stdout occurs after the kernel has already finish, and does not follow the original order.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow