Question

I am trying to speed up my MPI project by utilizing openMP. I have a dataset of 1000 2d points, and am using a brute force algorithm to find the minimum distance in the 2d graph. When I try to split the thread of execution, however, it greatly hurts the performance.How can I properly utilize openMP?

Here is my attempt:

double calcDistance(double input[][2], int start, int stop){

    double temp;
    //declare and initialize minimum
    double minimum =  pow (((input[start+1][0]) - (input[start][0])),2) + pow(((input[start+1][1]) - (input[start][1])),2);
    minimum = sqrt(minimum);

    closestIndex1 = start;
    closestIndex2 = start+1;
    //Brute Force Algorithm to find minimum distance in dataset.

        #pragma omp parallel for
        for(int i=start; i<stop;i++){
            for(int j=start; j<stop;j++){

                    temp = pow(((input[j][0]) - (input[i][0])),2) + pow(((input[j][1]) - (input[i][1])),2);
                    temp = sqrt(temp);
                    if(temp < minimum && i < j){
                            minimum = temp; 
                            closestIndex1 = i;
                            closestIndex2 = j;
                    }//endif
            }//end j
          }//end i
return minimum;
}

I have to say WOW. Thank you, that was incredibly helpful, and really cleared up a bunch of questions that I had. Again, thank you, gha.st.

Was it helpful?

Solution

Analysis

First, it is pure luck that your program seems to work like this. You do indeed have a data race, that causes invalid results on my machine. Consider the following test harness for this post:

::std::cout << ::xtd::target_info() << "\n\n"; // [target os] [target architecture] with [compiler]

static const int count = 30000;
auto gen = ::std::bind(::std::normal_distribution<double>(0, 1000), ::std::mt19937_64(42));
std::unique_ptr<double[][2]> input(new double[count][2]);
for(size_t i = 0; i < count; ++i)
{
    input[i][0] = gen();
    input[i][1] = gen();
}

::xtd::stopwatch sw; // does what its name suggests
sw.start();
double minimum = calcDistance(input.get(), 0, count);
sw.stop();
::std::cout << minimum << "\n";
::std::cout << sw << "\n";

When executing your function with the omp pragma removed, its output is:

Windows x64 with icc 14.0

0.0559233
7045 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
7272 ms

When executing it with the omp pragma intact, its output is:

Windows x64 with icc 14.0

0.324085
675.9 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
4338 ms

Since the machine uses 24 threads (on 12 cores with HT enabled), the speedup is evident, but could possibly be better, at least for msvc. The compiler that generates a faster program (icc) also shows the data race by giving wrong results which are different each run.

Note: I was also able to see an incorrect result from msvc, when compiling a debug version for x86 with 10k iterations.

Proper usage of iteration-local variables

The temp variable in your code has a lifetime of one iteration of the innermost loop. By moving its scope to match its lifetime, we can eliminate one source of data races. I also took the liberty of removing two unused variables and changing the initialization of minimum to a constant:

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    //#pragma omp parallel for // still broken
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][2]) - (input[i][3])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j) minimum = temp;
        }
    }
    return minimum;
}

A proper OMP minimum computation

OMP has support for reductions, which will in all probability perform fairly well. To try it, we will use the following pragma, which ensures that each thread works on its own minimum variable which are combined using the minimum operator:

#pragma omp parallel for reduction(min: minimum)

The results validate the approach for ICC:

Windows x64 with icc 14.0

0.0559233
622.1 ms

But MSVC howls error C3036: 'min' : invalid operator token in OpenMP 'reduction' clause, because it does not support minimum reductions. To define our own reduction, we will use a technique called double-checked locking:

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++){
        for(int j = start; j < stop; j++) {
            double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][1]) - (input[i][1])), 2);
            temp = sqrt(temp);
            if(temp < minimum && i < j)
            {
                #pragma omp critical
                if(temp < minimum && i < j) minimum = temp;
            }
        }
    }
    return minimum;
}

This is not only correct, but also leads to comparable performance for MSVC (note that this is significantly faster than the incorrect version!):

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
653.1 ms

The performance of ICC does not suffer significantly:

Windows x64 with icc 14.0

0.0559233
636.8 ms

Serial optimizations

While the above is a proper parallelization of your serial code, it can be significantly optimized by considering that you are computing a whole bunch of temp results you are never going to use due to your i < j condition.

By simply changing the start point of the inner loop, not only will this computation be completely avoided, it also simplifies the loop conditions.

Another trick we use is delaying the sqrt computation until the last possible second, since it is a homomorphic transformation, we can just sort on the square of the distance.

Finally, calling pow for a square is fairly inefficient, since it incurs a ton of overhead that we do not need.

This leads to the final code

double calcDistance(double input[][2], int start, int stop){
    double minimum = ::std::numeric_limits<double>::infinity();
    #pragma omp parallel for
    for(int i = start; i < stop; i++) {
        for(int j = i + 1; j < stop; j++) {
            double dx = input[j][0] - input[i][0];
            dx *= dx;
            double dy = input[j][1] - input[i][1];
            dy *= dy;
            double temp = dx + dy;
            if(temp < minimum)
            {
                #pragma omp critical
                if(temp < minimum) minimum = temp;
            }
        }
    }
    return sqrt(minimum);
}

Leading to the final performance:

Windows x64 with icc 14.0

0.0559233
132.7 ms

or

Windows x64 with msvc VS 2013 (18.00.21005)

0.0559233
120.1 ms
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top