Code:

double x(){return (double)rand()/(double)RAND_MAX;}
double y(){return (double)rand()/(double)RAND_MAX;}
double z(){return (double)rand()/(double)RAND_MAX;}

int d(double x, double y, double z){
        if ( ( (pow(x,2)+pow(y,2)) <1 ) && ( z<=1 && z>=0 )) return 1;
        return 0;
    }

double f(double x, double y, double z){
        return 1;
    }




#pragma omp parallel default(none) private(id,numt,j,local_sum,local_good_dots,local_coi,x_,y_,z_) shared(total_sum,good_dots,count_of_iterations)
    {
        local_coi = count_of_iterations;
        id = omp_get_thread_num() + 1;
        numt = omp_get_num_threads();
        #pragma omp for
        for (j = 1; j <= local_coi;  j++){
            x_=x();
            y_=y();
            z_=z();
            if (d(x_,y_,z_) == 1){
                local_sum += f(x_,y_,z_);
                local_good_dots += 1;

            }
        }

        #pragma omp critical
        {
            total_sum = total_sum + local_sum;
            good_dots = good_dots + local_good_dots;
        }
    }

Comment: This code is a realization of Monte-Carlo method for calculation of three-dimensional integral of function f() in area d().

I expect, that this code will work faster in multi-thread mode (openmp).

But something go wrong.

After several hours of modifications (reduction in openmp pragma, simplifactions of if-condition (like f(x_,y_,z_) * d(x_,y_,z_))) i'd not understood, why this simple loop become more slower on bigger numbers of threads.

But after i generate a 3-dimension array for each coordinate before loop and drop it in shared, my program become more faster.

So, question:

How to modificate this code and which functions(operations) are allowed in parallel-blocks?

P.S: as i see, that rand function are not allowed (or i'm wrong?)

Thanks for help!

Modification (with @HristoIliev's help)

double x(){return (double)rand()/(double)RAND_MAX;}
double y(){return (double)rand()/(double)RAND_MAX;}
double z(){return (double)rand()/(double)RAND_MAX;}

int d(double x, double y, double z){
        if ( ( (pow(x,2)+pow(y,2)) <1 ) && ( z<=1 && z>=0 )) return 1;
        return 0;
    }

double f(double x, double y, double z){
        return 1;
    }


#pragma omp parallel default(none) private(j,local_coi,x_,y_,z_) shared(count_of_iterations) reduction(+:total_sum,good_dots)
    {
        local_coi = count_of_iterations;
        #pragma omp for(prng)
        for (j = 1; j <= local_coi;  j++){                    
        #pragma omp critical(prng)
        {
                x_=x();
                y_=y();
                z_=z();
        }   
            if (d(x_,y_,z_) == 1){
                total_sum += f(x_,y_,z_);
                good_dots += 1;

            }
        }
    }
有帮助吗?

解决方案

The random number generator rand() uses a global statically allocated state, shared by all threads and is thus not thread safe. Using it from multiple threads you run into a very bad case of unprotected access to shared variables which trashes the cache and slows the program down. You should be using rand_r() or erand48() instead - they use separate state storages that you have to provide. You have to declare one state per thread (e.g. have it private), basically creating different PRNG for each thread. Then you have to seed them accordingly, otherwise you would get statistically bad results. In principle you can use the output of one rand48() generator to seed the others - it should be enough in order to get moderately long uncorrelated sequences.

Here is a sample implementation using rand_r() (not that this is a very bad generator for Monte Carlo simulations, erand48 is better and the best would be to use the "Mersenne Twister" type generator from GNU Scientific Library if available):

unsigned int prng_state;
#pragma omp threadprivate(prng_state)

double x(){return (double)rand_r(&prng_state)/(double)RAND_MAX;}
double y(){return (double)rand_r(&prng_state)/(double)RAND_MAX;}
double z(){return (double)rand_r(&prng_state)/(double)RAND_MAX;}

int d(double x, double y, double z){
    if ( ( (pow(x,2)+pow(y,2)) <1 ) && ( z<=1 && z>=0 )) return 1;
    return 0;
}

double f(double x, double y, double z){
    return 1;
}

...

#pragma omp parallel default(none) \
            private(id,numt,x_,y_,z_) \
            shared(count_of_iterations) \
            reduction(+:total_sum,good_dots)
{
    id = omp_get_thread_num() + 1;
    numt = omp_get_num_threads();

    // Sample PRNG seeding code - DO NOT USE IN PRODUCTION CODE!
    prng_state = 67894 + 1337*id;

    #pragma omp for
    for (j = 1; j <= count_of_iterations;  j++){
        x_=x();
        y_=y();
        z_=z();
        if (d(x_,y_,z_) == 1){
            total_sum += f(x_,y_,z_);
            good_dots += 1;
        }
    }
}

This is just a very bad (from quality point of view) implementation, but it should give you the idea of how things work. It is also how you can achieve thread safety with minimal changes to your original code. The basic points are:

  • the PRNG state prng_state is made private to each thread by the OpenMP threadprivate directive;
  • rand_r() with the thread-specific state variable is used instead of rand() in x(), y() and z();
  • the PRNG state is initialised in a thread-dependent manner, e.g. prng_state = 67894 + 1337*id;, so that different threads would (hopefully) get uncorrelated streams of pseudo-random numbers.

Note that rand() and rand_r() are of such bad quality that this is just an academic example. With longer PRNG sequences you would get correlated streams in the different threads which would spoil the statistics. I leave it up to you to rewrite the code using erand48().

To answer your initial question - all thread-safe function calls are allowed inside a parallel block. You could also call non-thread-safe functions but you have to protect the calls inside (named) critical constructs, e.g.:

#pragma omp for
for (j = 1; j <= local_coi; j++) {
    #pragma omp critical(prng)
    {
        x_=x();
        y_=y();
        z_=z();
    }
    if (d(x_,y_,z_) == 1) {
        local_sum += f(x_,y_,z_);
        local_good_dots += 1;
    }
}

This would ensure that no calls to rand() would be made in parallel. But you would still have read-modify-write kind of access to the shared state, hence the cache-related slowdown.

Also, do not try to reimplement OpenMP reduction or similar constructs. Compiler vendors are already putting tremendous efforts into making sure they are implemented in the best (read fastest) way possible.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top