Correct OpenMP pragmas for pi monte carlo in C with not thread-safe random number generator

Question 1

Try the code below. It makes a private state for each thread. I did something similar with the at rand_r function Why does calculation with OpenMP take 100x more time than with a single thread?

Edit: I updated my code using some of Hristo's suggestions. I used threadprivate (for the first time). I also used a better rand function which gives a better estimate of pi but it's still not good enough.

One strange things was I had to define the function rng_int after threadprivate otherwise I got an error "error: 'state' declared 'threadprivate' after first use". I should probably ask a question about this.

//gcc -O3 -Wall -pedantic -fopenmp main.c
#include <omp.h>
#include <stdio.h>

#define RNG_MOD 0x80000000
int state;

int rng_int(void);
double rng_doub(double range);

int main() {
    int i, numIn, n;
    double x, y, pi;

    n = 1<<30;
    numIn = 0;

    #pragma omp threadprivate(state)
    #pragma omp parallel private(x, y) reduction(+:numIn) 
    {

        state = 25234 + 17 * omp_get_thread_num();
        #pragma omp for
        for (i = 0; i <= n; i++) {
            x = (double)rng_doub(1.0);
            y = (double)rng_doub(1.0);
            if (x*x + y*y <= 1) numIn++;
        }
    }
    pi = 4.*numIn / n;
    printf("asdf pi %f\n", pi);
    return 0;
}

int rng_int(void) {
   // & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
   return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}

double rng_doub(double range) {
    return ((double)rng_int()) / (((double)RNG_MOD)/range);
}

You can see the results (and edit and run the code) at http://coliru.stacked-crooked.com/a/23c1753a1b7d1b0d

Question 2

Your original linear congruent PRNG has a cycle length of 49400, therefore you are only getting 29700 unique test points. This is a terrible generator to be used for any kind of Monte Carlo simulations. Even if you make 100000000 trials, you won't get any closer to the true value of Pi because you are simply repeating the same points over and over again and as a result both the final value of z and iters are simply multiplied by the same constant, which cancel in the end during the division.

The per-thread seed introduced by Z boson improves the situation a little bit with the number of unique points increasing with the total number of OpenMP threads. The increase is not linear since if the seed of one PRNG falls in the sequence of another PRNG, both PRNGs produce the same sequence shifted with no more than 49400 elements. Given the cycle length, each PRNG covers 49400/RNG_MOD = 6,7% of the total output range and that is the probability of two PRNGs being synchronised. There are a total of RNG_MOD/49400 = 15 unique sequences possible. It basically means that in the best seeding case scenario you won't be able to get past 30 threads as any other thread would simply repeat the result of some of the others. The multiplier 2 comes from the fact that each point uses two elements from the sequence and therefore it is possible to get a different set of points if you shift the sequence by one element.

The ultimate solution is to completely drop your PRNG and stick to something like Mersenne twister MT19937, which has a cycle length of 2¹⁹⁹³⁷ − 1 and a very strong seeding algorithm. If you are not able to use another PRNG as you state in your question, at least modify the constants of the LCG to match those used in rand():

int rng_int(void) {
   static int state = 1;

   // & 0x7fffffff is equivalent to modulo with RNG_MOD = 2^31
   return (state = (state * 1103515245 + 12345) & 0x7fffffff);
}

Note that rand() is not a good PRNG - it is still bad. It is just a little better than the one used in your code.