Efficient bucket-sort on GPU

Question 1

Firstly never try and implement a locking algorithm on GPU. It will deadlock and stall. This is because a GPU is a SIMD device and the threads don't execute independently as on a CPU. A GPU executes a set threads called a WARP/WaveFront synchronously. So if one thread in the Wave Front is stalled it stalls all other threads in the Wave Front. If the unlock thread is in a stalled Wave Front it will NOT execute and unlock the the mutex.

Atomic operations are OK.

What you should consider is a lock free approach. See this paper for an explanation and sample CUDA code: http://www.cse.iitk.ac.in/users/mainakc/pub/icpads2012.pdf/

It describes lock free hash tables, linked list and skip lists with some sample CUDA code.

A suggested approach is to create a two level data structure.

The first level is a lock free skip list. Each skip list entry has the second level structure of a lock free linked list for duplicate values. And an atomic count of the number of entries.

The insert method

1) Generate 64 bucket key 2) Find key in Skip list 3) If not found insert into the skip list 4) Insert data in to linked list 5) increment the atomic counter for this bucket

After insertion prefix sum all the counters of the skip list buckets so you find the offset of the out put.

Question 2

I found a much easier way to append the local buffers to the global arrays. It only requires two steps, one of which involves atomic operations.

The first step is to assign the index in the global target array where each thread will write its elements. For this, we can use in a atomic_add(__global int*) to add the number of elements to be appended. Use this function on the bucket_start in this concrete example. The return value of atomic_add is the old value.

In the second step we use this return value as the base index for copying the local buffers in the target array. If we decide to use a whole thread group for one such append operation, we distribute copying the local buffer to the global array within the thread group "as usual". In the above example of bucket sort, we copy multiple arrays, and when the number of arrays (= number of buckets) equals the work group size, we can instead assign each thread one bucket, which will be copied in a loop.

Question 3

I recently had to solve a similar problem, and I found a much more elegant and efficient solution. I thought I would share.

The general algorithm is as follows:

1. kernel 1: thread per element

Count the number of elements in each bucket (histogram).
For each element: calculate the offset of each value from the start of the bucket (tricky part).

2. kernel 2: thread per bucket

prefix sum (scan) on the histogram to calculate the start of each bucket

3. kernel 3: thread per element

scatter elements.

for each element i in input: output[i] = prefix_sum[input[i]] + offsets[i];

The tricky part is to generate the offsets array that we use for in the 3rd kernel.

On the First kernel, We define a local cache that contains the per workgroup bucket histogram. I use the fact that atomic_add returns the previous value of this counter - the 'current' offset. This fact is the key.

__kernel void bucket_histogram(__global uint *input,__global uint *histogram,__global uint *offsets) {

__local local_histogram[NUM_BUCKETS];

size_t local_idx = get_local_id(0);
size_t global_idx = get_global_id(0);

// zero local mem

if (local_idx < NUM_BUCKETS)
{
    local_histogram[local_idx] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);

// increment local histogram, save the local offset for later
uint value = input[global_idx];
uint local_offset = atomic_add(&local_histogram[value], 1);

barrier(CLK_LOCAL_MEM_FENCE);

// store the buckets in the global histogram (for later prefix sum)

if (local_idx < NUM_BUCKETS)
{
    uint count = local_histogram[local_idx];
    if (count > 0)
    {
        // increment the global histogram, save the offset!
        uint group_offset_for_the_value_local_idx = atomic_add(&histogram[local_idx], count);
        local_histogram[local_idx] = group_offset_for_the_value_local_idx;
    }
}

barrier(CLK_LOCAL_MEM_FENCE);

// now local_histogram changes roles, it contains the per-value group offset from the start of the bucket

offsets[global_idx] = local_offset + local_histogram[value];

The second kernel preforms prefix sum to calculate the start of each bucket. The 3rd kernel simply combines all the offsets:

__kernel void bucket_sort_scatter(__global uint *input, __global uint* prefix_sum_histogram, __global uint* offsets, __global data_t *output) {

size_t global_idx = get_global_id(0);
uint value = input[global_idx];
uint scatter_target = offsets[global_idx] + prefix_sum_histogram[value];
output[scatter_target] = value;