Question

I tried to implement cumulative sum with opencl as follows:

__kernel void cumsum(__global float *a)
{
    int gid = get_global_id(0);
    int n = get_global_size(0);

    for (int i = 1; i < n; i <<= 1)
        if (gid & i)
            a[gid] += a[(gid & -i) - 1];
}

I called this code using pyopencl:

import pyopencl as cl
import pyopencl.array as cl_array
import numpy as np

a = np.random.rand(50000).astype(np.float32)

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

a_dev = cl_array.to_device(queue, a)

with open("imm/cluster.cl", 'r') as f:
    prg = cl.Program(ctx, f.read()).build()

prg.cumsum(queue, a.shape, None, a_dev.data)
print(np.cumsum(a)[:33], a_dev[:33])

However, the first 32 numbers are correct, after which they're wrong (too low). Is this something to do with the work group sizes? How do I fix this?

Was it helpful?

Solution

When i becomes large enough, you will be reading the output of another work-group. Nothing in the OpenCL execution model guarantees this other work-group will have finished execution.

In general it will not be the case, and you will read a partial sum, getting lower values than expected at the end.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top