
I've been playing with OpenCL recently, and I'm able to write simple kernels that use only global memory. Now I'd like to start using local memory, but I can't seem to figure out how to use get_local_size() and get_local_id() to compute one "chunk" of output at a time.

For example, let's say I wanted to convert Apple's OpenCL Hello World example kernel to something the uses local memory. How would you do it? Here's the original kernel source:

__kernel square(
    __global float *input,
    __global float *output,
    const unsigned int count)
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];

If this example can't easily be converted into something that shows how to make use of local memory, any other simple example will do.

There is another possibility to do this, if the size of the local memory is constant. Without using a pointer in the kernels parameter list, the local buffer can be declared within the kernel just by declaring it __local:

__local float localBuffer[1024];

This removes code due to less clSetKernelArg calls.

In OpenCL local memory is meant to share data across all work items in a workgroup. And it usually requires to do a barrier call before the local memory data can be used (for example, one work item wants to read a local memory data that is written by the other work items). Barrier is costly in hardware. Keep in mind, local memory should be used for repeated data read/write. Bank conflict should be avoided as much as possible.

If you are not careful with local memory, you may end up with worse performance some time than using global memory.

