Question

I have a large array of floating point numbers and I want to find out the minimum value of the array (ignoring -1s wherever present) as well as its index, using reduction in CUDA. I have written the following code to do this, which in my opinion should work:

 __global__ void get_min_cost(float *d_Cost,int n,int *last_block_number,int *number_in_last_block,int *d_index){
     int tid = threadIdx.x;
     int myid = blockDim.x * blockIdx.x + threadIdx.x;
     int s;

     if(result == (*last_block_number)-1){
         s = (*number_in_last_block)/2;
     }else{
         s = 1024/2;
     }

     for(;s>0;s/=2){
         if(myid+s>=n)
             continue;
         if(tid<s){
             if(d_Cost[myid+s] == -1){
                 continue;
             }else if(d_Cost[myid] == -1 && d_Cost[myid+s] != -1){
                 d_Cost[myid] = d_Cost[myid+s];
                 d_index[myid] = d_index[myid+s];
             }else{
                 // both not -1
                 if(d_Cost[myid]<=d_Cost[myid+s])
                     continue;
                 else{
                     d_Cost[myid] = d_Cost[myid+s];
                     d_index[myid] = d_index[myid+s];
                 }
             }
         }
         else
             continue;
         __syncthreads();
     }
     if(tid==0){
         d_Cost[blockIdx.x] = d_Cost[myid];
         d_index[blockIdx.x] = d_index[myid];
     }
     return;
 }

The last_block_number argument is the id of the last block, and number_in_last_block is the number of elements in last block (which is a power of 2). Thus, all blocks will launch 1024 threads every time and the last block will only use number_in_last_block threads, while others will use 1024 threads.

After this function runs, I expect the minimum values for each block to be in d_Cost[blockIdx.x] and their indices in d_index[blockIdx.x].

I call this function multiple times, each time updating the number of threads and blocks. The second time I call this function, the number of threads now become equal to the number of blocks remaining etc.

However, the above function isn't giving me the desired output. In fact, it gives a different output every time I run the program, i.e, it returns an incorrect value as the minimum during some intermediate iteration (though that incorrect value is quite close to the minimum every time).

What am I doing wrong here?

Was it helpful?

Solution

As I mentioned in my comment above, I would recommend to avoid writing reductions of your own and use CUDA Thrust whenever possible. This holds true even in the case when you need to customize those operations, the customization being possible by properly overloading, e.g., relational operations.

Below I'm providing a simple code to evaluate the minimum in an array along with its index. It is based on a classical example contained in the An Introduction to Thrust presentation. The only addition is skipping, as you requested, the -1's from the counting. This can be reasonably done by replacing all the -1's in the array by INT_MAX, i.e., the maximum representable integer according to IEEE floating point standards.

#include <thrust\device_vector.h>
#include <thrust\replace.h>
#include <thrust\sequence.h>
#include <thrust\reduce.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\tuple.h>

// --- Struct returning the smallest of two tuples
struct smaller_tuple
{
    __host__ __device__ thrust::tuple<int,int> operator()(thrust::tuple<int,int> a, thrust::tuple<int,int> b)
    {
        if (a < b)
            return a;
        else
            return b;
    }
};


void main() {

    const int N = 20;
    const int large_value = INT_MAX;

    // --- Setting the data vector
    thrust::device_vector<int> d_vec(N,10);
    d_vec[3] = -1; d_vec[5] = -2;

    // --- Copying the data vector to a new vector where the -1's are changed to FLT_MAX
    thrust::device_vector<int> d_vec_temp(d_vec);
    thrust::replace(d_vec_temp.begin(), d_vec_temp.end(), -1, large_value);

    // --- Creating the index sequence [0, 1, 2, ... )
    thrust::device_vector<int> indices(d_vec_temp.size());
    thrust::sequence(indices.begin(), indices.end());

    // --- Setting the initial value of the search
    thrust::tuple<int,int> init(d_vec_temp[0],0);

    thrust::tuple<int,int> smallest;
    smallest = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.begin(), indices.begin())),
                          thrust::make_zip_iterator(thrust::make_tuple(d_vec_temp.end(), indices.end())),
                          init, smaller_tuple());

    printf("Smallest %i %i\n",thrust::get<0>(smallest),thrust::get<1>(smallest));
    getchar();
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top