How to use atomicCAS for multiple variables with conditionals in CUDA

Question 1

You could use a critical section to have each thread have exclusive access to the data while it is updating it.
Since your gbl_min_dist is a 32-bit value, if you can figure out a way to squeeze both p1 and p2 into a single 32-bit value, you could use an approach like the custom atomics answer I gave here.

If you simply use whether or not the atomicCAS made the first swap to condition additional code to update p1 and p2, I think it's still possible to have a race condition that allows your data to get out of sync between thread updates.

Question 2

You could construct a critical section to atomically update the min value and corresponding point indices. The following link gives a example on how to build the CS with atomicCAS() and atomicExch().

https://github.com/ArchaeaSoftware/cudahandbook/blob/master/memory/spinlockReduction.cu

On the other hand, I would suggest replace the atomic min operations by parallel reduction algorithm. That may improve the performance.

Question 3

The way I suggest is, rather than depend on a stored distance, recompute it any time it is critical that the stored points may have changed:

typedef struct {
    unsigned int point1, 
    unsigned int point2;
}

global_closest_points, local_closest_points, temp_c_p;

local_dist = distance(local_closest_points.point1, local_closest_points.point2);
temp_c_p = global_closest_points;
while (local_dist < distance(temp_c_p.point1, temp_c_p.point2)
    temp_c_p = atomicCAS(&global_closest_points, temp_c_p, local_closest_points);

Old habits were, save rather than recompute. But with modern processors, that's often not optimal. On CUDA, an atomic update to global memory takes more time than computing hundreds of double-precision distances.