The issue is that threads that run concurrently will see the same leastDistance
and overwrite each other's results. There are two values that are shared between threads; leastDistance
and winner
. You have two basic options. You can write out the results from all the threads and then do a second pass over the data with a parallel reduction to determine which vector had the best match or you can implement this with a custom atomic operation using atomicCAS()
.
The first method is the easiest. My guess is that it will also give you the best performance, though it does add a dependency for the the free Thrust library. You would use thrust::min_element().
The method using atomicCAS()
uses the fact that atomicCAS()
has a 64-bit mode, in which you can assign any semantics that you wish to a 64-bit value. In your case, you would use 32 bits to store leastDistance
and 32 bits to store winner
. To use this method, adapt this example in the CUDA C Programming Guide that implements a double precision floating point atomicAdd()
.
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull =
(unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}