Replacing for-loops using thrust::transform

https://stackoverflow.com/questions/15823015

01-04-2022
|

Question

I am trying to optimize my code by implementing for loops on threads of the GPU. I am trying to eliminate two for loops using thrust::transform. The code in C++ looks like:

    ka_index = 0;
    for (int i = 0; i < N_gene; i++)
    {
        for (int j = 0; j < n_ka_d[i]; j++ )
        {
            co0 = get_coeff0(ka_vec_d[ka_index]);
            act[i] += (co0*ka_val_d[ka_index]); 
            ka_index++;
        }
        act[i] = pow(act[i],n); 
    }

I am estimating co-efficients for an ordinary differential equation(ODE) in the above loops and have transferred all the data onto the device using thrust. Consider the case where the number of genes is represented by N_gene. The fist for loop has to run N_gene number of times. The second for loop is restricted by the number of activators(other friendly genes in the gene pool) of each gene. Each gene has a number of activators(friendly genes whose presence increases the concentration of gene i) represented by elements of n_ka vector. Value of n_ka[i] can vary from 0 to N_gene - 1. ka_val represents the measure of activation for each activator ka. ka_vec_d has the gene index which activates gene i.

I am trying to represent these loops using iterators, but unable to do so. I am familiar with using thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple)) for a single for loop, but having a tough time coming up with a way to implement two for loops using counting_iterator or transform iterators. Any pointers or help to convert these two for loops will be appreciated. Thanks for your time!

Solution

This looks like a reduce problem. I think you can use thrust::transform with zip iterators and thrust::reduce_by_key. A sketch of this solution is:

// generate indices
std::vector< int > hindices;
for( size_t i=0 ; i<N_gene ; ++i )
    for( size_t j=0 ; j<n_ka_d[i] ; ++j )
     hindices.push_back( i );
thrust::device_vector< int > indices = hindices;

// generate tmp
// trafo1 implements get_coeff0( get< 0 >( t ) ) * get< 1 >( t);
thrust::device_vector< double > tmp( N );
thrust::transform(
    thrust::make_zip_iterator(
        thrust::make_tuple( ka_vec_d.begin() , ka_val_d.begin() ) ) ,
    thrust::make_zip_iterator(
        thrust::make_tuple( ka_vec_d.end() , ka_val_d.end() ) ) ,
    tmp.begin() , trafo1 );

// do the reduction for each ac[i]
thrust::device_vector< int > indices_out( N );
thrust::reduce_by_key( indices.begin() , indices.end() , tmp.begin() ,
    ac.begin() , indices_out.begin() );

// do the pow transformation
thrust::transform( ac.begin() , ac.end() , ac.begin() , pow_trafo );

I this this can also be optimized by transform_iterators to reduce the number of calls of thrust::transform and thrust::recuce_by_key.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow