Question

For a current OpenCL GPGPU project, I need to sort elements in an array according to some key with 64 possible values. I need the final array to have all elemens with the same key to be contiguous. It's sufficient to have an associative array new_index[old_index] as the output of this task.

I split the task into two parts. First, I count for each possible key (bucket) the number of elements with this key (that go into that bucket). I scan this array (generate a prefix sum) which indicates the new index range of the elements for each bucket, like "start" indices per bucket.

The second step will then have to assign each element a new index. If I was to implement this on a CPU, the algorithm would be something like this:

for all elements e:
    new_index[e] = bucket_start[bucket(e)]++

Of course, this doesn't work on the GPU. Every item needs to access the bucket_start array in read-write mode which is essentially a synchronization between all work items, the worst we can do.

An idea is to put some computation in work groups. But I'm not sure how this should be done exactly, since I'm not experienced in GPGPU computing.

In global memory, we have the bucket start array initialized with the prefix sum as above. Access to this array is "mutexed" with an atomic int. (I'm new to this, so maybe mixing some words here.)

Every work group is implicitly assigned a part of the input element array. It uses a local bucket array containing the new indices, relative to a (global) bucket start which we do not yet know. After one of these "local buffers" is full, the work group has to write the local buffers into global array. For this, it locks access to the global bucket start array, increments these values by the current local bucket sizes, unlocks, and then can write the result into the global new_index array (by adding the according offset). This process is repeated until all assigned elements are processed.

Two questions arise:

  1. Is this a good approach? I know that reading and writing from/to global memory is most likely the bottleneck here, especially since I'm trying to acquire synchronized access to (at least only a small part of) the global memory. But maybe there is a much better approach to do this, maybe using kernel decomposition. Note that I try to avoid reading back data from GPU to CPU during kernels (to avoid an OpenCL command queue flush, which is also bad as I was tought).

  2. In the algorithm design above, how do I implement the locking mechanism? Will something like the following code work? In particular, I expect problems when the hardware executes work items "truly parallel" in SIMD groups, like Nvidia "warps". In my current code, all items of a work group would try to acquire the lock in an SIMD fashion. Should I restrict this to the first work item only? And use barriers to keep them locally in sync?

    #pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
    
    __kernel void putInBuckets(__global uint *mutex,
                               __global uint *bucket_start,
                               __global uint *new_index)
    {
        __local bucket_size[NUM_BUCKETS];
        __local bucket[NUM_BUCKETS][LOCAL_MAX_BUCKET_SIZE]; // local "new_index"
    
        while (...)
        {
            // process a couple of elements locally until a local bucket is full
            ...
    
            // "lock"
            while(atomic_xchg(mutex, 1)) {
            }
    
            // "critical section"
            __local uint l_bucket_start[NUM_BUCKETS];
            for (int b = 0; b < NUM_BUCKETS; ++b) {
                l_bucket_start[b] = bucket_start[b]; // where should we write?
                bucket_start[b] += bucket_size[b];   // update global offset
            }
    
            // "unlock"
            atomic_xchg(mutex, 0);
    
            // write to global memory by adding the offset
            for (...)
                new_index[...] = ... + l_bucket_start[b];
        }
    }
    

No correct solution

OTHER TIPS

Firstly never try and implement a locking algorithm on GPU. It will deadlock and stall. This is because a GPU is a SIMD device and the threads don't execute independently as on a CPU. A GPU executes a set threads called a WARP/WaveFront synchronously. So if one thread in the Wave Front is stalled it stalls all other threads in the Wave Front. If the unlock thread is in a stalled Wave Front it will NOT execute and unlock the the mutex.

Atomic operations are OK.

What you should consider is a lock free approach. See this paper for an explanation and sample CUDA code: http://www.cse.iitk.ac.in/users/mainakc/pub/icpads2012.pdf/

It describes lock free hash tables, linked list and skip lists with some sample CUDA code.

A suggested approach is to create a two level data structure.

The first level is a lock free skip list. Each skip list entry has the second level structure of a lock free linked list for duplicate values. And an atomic count of the number of entries.

The insert method

1) Generate 64 bucket key 2) Find key in Skip list 3) If not found insert into the skip list 4) Insert data in to linked list 5) increment the atomic counter for this bucket

After insertion prefix sum all the counters of the skip list buckets so you find the offset of the out put.

I found a much easier way to append the local buffers to the global arrays. It only requires two steps, one of which involves atomic operations.

The first step is to assign the index in the global target array where each thread will write its elements. For this, we can use in a atomic_add(__global int*) to add the number of elements to be appended. Use this function on the bucket_start in this concrete example. The return value of atomic_add is the old value.

In the second step we use this return value as the base index for copying the local buffers in the target array. If we decide to use a whole thread group for one such append operation, we distribute copying the local buffer to the global array within the thread group "as usual". In the above example of bucket sort, we copy multiple arrays, and when the number of arrays (= number of buckets) equals the work group size, we can instead assign each thread one bucket, which will be copied in a loop.

I recently had to solve a similar problem, and I found a much more elegant and efficient solution. I thought I would share.

The general algorithm is as follows:

1. kernel 1: thread per element

  • Count the number of elements in each bucket (histogram).
  • For each element: calculate the offset of each value from the start of the bucket (tricky part).

2. kernel 2: thread per bucket

  • prefix sum (scan) on the histogram to calculate the start of each bucket

3. kernel 3: thread per element

  • scatter elements.

    for each element i in input: output[i] = prefix_sum[input[i]] + offsets[i];

The tricky part is to generate the offsets array that we use for in the 3rd kernel.

On the First kernel, We define a local cache that contains the per workgroup bucket histogram. I use the fact that atomic_add returns the previous value of this counter - the 'current' offset. This fact is the key.

__kernel void bucket_histogram(__global uint *input,__global uint *histogram,__global uint *offsets) {

__local local_histogram[NUM_BUCKETS];

size_t local_idx = get_local_id(0);
size_t global_idx = get_global_id(0);

// zero local mem

if (local_idx < NUM_BUCKETS)
{
    local_histogram[local_idx] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);

// increment local histogram, save the local offset for later
uint value = input[global_idx];
uint local_offset = atomic_add(&local_histogram[value], 1);

barrier(CLK_LOCAL_MEM_FENCE);

// store the buckets in the global histogram (for later prefix sum)

if (local_idx < NUM_BUCKETS)
{
    uint count = local_histogram[local_idx];
    if (count > 0)
    {
        // increment the global histogram, save the offset!
        uint group_offset_for_the_value_local_idx = atomic_add(&histogram[local_idx], count);
        local_histogram[local_idx] = group_offset_for_the_value_local_idx;
    }
}

barrier(CLK_LOCAL_MEM_FENCE);

// now local_histogram changes roles, it contains the per-value group offset from the start of the bucket

offsets[global_idx] = local_offset + local_histogram[value];

The second kernel preforms prefix sum to calculate the start of each bucket. The 3rd kernel simply combines all the offsets:

__kernel void bucket_sort_scatter(__global uint *input, __global uint* prefix_sum_histogram, __global uint* offsets, __global data_t *output) {

size_t global_idx = get_global_id(0);
uint value = input[global_idx];
uint scatter_target = offsets[global_idx] + prefix_sum_histogram[value];
output[scatter_target] = value;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top