Question

I recently encountered a simple notion in programming but i stuck when i tried to implement it in cuda. Suppose that i have thousands of elements and i want to find the closest pair between them. I use atomicMIN in global memory (suppose that we dont want to reduce) so if the distance which is calculated by each thread is smaller than the distance stored in the global variable the atomicCAS will replace it with the smaller value. For example i have the global variable float gbl_min_dist

To do this I use the following code:

__device__ inline float atomicMin(float *addr, float value){
    float old = *addr, assumed;
    if( old <= value ) return old;
    do{
        assumed = old;
        old = atomicCAS((unsigned int*)addr, __float_as_int(assumed), __float_as_int(value));
    }while( old!=assumed );
    return old;
}

Suppose now that we want to store the index of the two points that were closer together and for which the atomicMIN has successfully replaced the old minimum distance with the one calculated by those two points. What I mean is that I only want to store the indeces of the two points that currently have the smaller distance if and only if its distance has just been successfully swaped in the global variable

typedef struct {float gbl_min_dist, 
                unsigned int point1,
                unsigned int point2;} global_closest_points;

So here, when a thread executes the atomicMIN, if the value that is proposed by that tread to be compared is swapped in the gbl_min_dist then i also need to swap the p1, p2 with the values from the thread. If the gbl_min_dist is not swapped then I dont want to store the points cause this would give wrong points but correct minimum distance.

Is there any return value to check if atomicCAS has made the swap?

Any ideas on how to implement this within the atomicMIN?

Thanks in advance

Was it helpful?

Solution

  1. You could use a critical section to have each thread have exclusive access to the data while it is updating it.
  2. Since your gbl_min_dist is a 32-bit value, if you can figure out a way to squeeze both p1 and p2 into a single 32-bit value, you could use an approach like the custom atomics answer I gave here.

If you simply use whether or not the atomicCAS made the first swap to condition additional code to update p1 and p2, I think it's still possible to have a race condition that allows your data to get out of sync between thread updates.

OTHER TIPS

You could construct a critical section to atomically update the min value and corresponding point indices. The following link gives a example on how to build the CS with atomicCAS() and atomicExch().

https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/spinlockReduction.cu

On the other hand, I would suggest replace the atomic min operations by parallel reduction algorithm. That may improve the performance.

The way I suggest is, rather than depend on a stored distance, recompute it any time it is critical that the stored points may have changed:

typedef struct {
    unsigned int point1, 
    unsigned int point2;
}

global_closest_points, local_closest_points, temp_c_p;

local_dist = distance(local_closest_points.point1, local_closest_points.point2);
temp_c_p = global_closest_points;
while (local_dist < distance(temp_c_p.point1, temp_c_p.point2)
    temp_c_p = atomicCAS(&global_closest_points, temp_c_p, local_closest_points);

Old habits were, save rather than recompute. But with modern processors, that's often not optimal. On CUDA, an atomic update to global memory takes more time than computing hundreds of double-precision distances.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top