Question

I am in a startup of OpenCl and still learning.

Kernel Code:

__kernel void gpu_kernel(__global float* data)
{
    printf("workitem %d invoked\n", get_global_id(0));
    int x = 0;
    if (get_global_id(0) == 1) {
        while (x < 1) {
            x = 0;
        }
    }
    printf("workitem %d completed\n", get_global_id(0));
}

C code for invoking kernel

size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 1; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

Ouput:

workitem 3 invoked
workitem 3 completed
workitem 0 invoked
workitem 0 completed
workitem 1 invoked
workitem 2 invoked
workitem 2 completed

## Here code is waiting on terminal for Workitem #1 to finish, which will never end

this clearly states, all workitems are parallel (but in different workgroup).

Another C code for invoking kernel (for 1 workgroup with 4 workitems)

size_t global_item_size = 4; // number of workitems total
size_t local_item_size = 4; // number of workitems per group
ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_item_size, &local_item_size, 0, NULL, NULL);

Ouput:

workitem 0 invoked
workitem 0 completed
workitem 1 invoked
## Here code is waiting on terminal for Workitem #1 to finish, which will never end

This clearly states that, this running in sequence (that's why it completed 1st Workitem and then got stuck on second and rest are never executed)

My Question:

I need to invoke 1 workgroup with 4 workitems which run parallel. So that i can use barrier in my code (which i guess is only possible within single workgroup)?

any help/suggestion/pointer will be appreciated.

Was it helpful?

Solution

Your second host code snippet correctly launches a single work-group that contains 4 work-items. You have no guarantees that these work-items will run in parallel, since the hardware might not have the resources to do so. However, they will run concurrently, which is exactly what you need in order to be able to use work-group synchronisation constructs such as barriers. See this Stack Overflow question for a concise description of the difference between parallelism and concurrency. Essentially, the work-items in a work-group will make forward progress independently of each other, even if they aren't actually executing in parallel.

OpenCL 1.2 Specification (Section 3.2: Execution Model)

The work-items in a given work-group execute concurrently on the processing elements of a single compute unit.

Based on your previous question on a similar topic, I assume you are using AMD's OpenCL implementation targeting the CPU. The way most OpenCL CPU implementations work is by serialising all work-items from a work-group into a single thread. This thread then executes each work-item in turn (ignoring vectorisation for the sake of argument), switching between them when they either finish or hit a barrier. This is how they achieve concurrent execution, and gives you all the guarantees you need in order to safely use barriers within your kernel. Parallel execution is achieved by having multiple work-groups (as in your first example), which will result in multiple threads executing on multiple cores (if available).

If you replaced your infinite loop with a barrier, you would clearly see that this does actually work.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top