OpenCL - reusing global memory

https://stackoverflow.com/questions/22291785

12-06-2023
|

Question

This is a seemingly basic problem I haven't been able to right with a fair amount of trial and error. I have a kernel which makes use of two global r/w buffers and one local - it takes input from the first buffer, does a pseudo-sort on it using the second buffer for interim storage, and ultimately copies it back to the first in a certain order. (Stripped) code is as follows:

struct PACKET_POINTER {
       int packetIndex;
       int currentCell;
};

#define RPC_DIV_BUCKET 100
__kernel void PseudoSort(__global struct PACKET_POINTER * in,__global struct PACKET_POINTER * out, __local struct PACKET_POINTER * aux) {
  int i = get_local_id(0);
  int wg = get_local_size(0);
  int gid = get_global_id(0);
  int offset = get_group_id(0) * wg;

  aux[i] = in[i+offset];
  barrier(CLK_LOCAL_MEM_FENCE);
  //-----
  //Irrelevant code block here
  //----- 
  out[(gid%1024)*RPC_DIV_BUCKET + (gid/1024)] = aux[i];
}

Retrieving the contents of the "out" buffer in the parent C program happens without issue. However, when I add the following lines to the kernel:

    barrier(CLK_GLOBAL_MEM_FENCE);
    in[gid] = out[gid];

and attempt to read the "in" buffer, it turns up mostly garbage values on first execution, but will have the expected data if the .exe is run a second time without modification. I have a clFinish(commands) call between the kernel call and buffer read, so it should be running to completion before any read attempts. Something obvious I'm missing here? Appreciate the help in advance - will post a solution if I happen upon it before then.

Solution

CLK_GLOBAL_MEM_FENCE only syncs within a workgroup. There is no way to place a barrier that would sync across all workgroups (e.g it only syncs across those threads which have identical group_id).

You have a race condition there. As an example when global_id is 1 a write goes into out[100]. Then that particular thread reads from out[1] and writes to in[1]. However out[1] is written only at global_id 1024. Which is almost certainly in a different workgroup. So you will read garbage as the first workgroup is going to finish before the out[1] is ever going to get written.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow