CLK_GLOBAL_MEM_FENCE only syncs within a workgroup. There is no way to place a barrier that would sync across all workgroups (e.g it only syncs across those threads which have identical group_id).
You have a race condition there. As an example when global_id is 1 a write goes into out[100]. Then that particular thread reads from out[1] and writes to in[1]. However out[1] is written only at global_id 1024. Which is almost certainly in a different workgroup. So you will read garbage as the first workgroup is going to finish before the out[1] is ever going to get written.