Analysis
First, it is pure luck that your program seems to work like this. You do indeed have a data race, that causes invalid results on my machine. Consider the following test harness for this post:
::std::cout << ::xtd::target_info() << "\n\n"; // [target os] [target architecture] with [compiler]
static const int count = 30000;
auto gen = ::std::bind(::std::normal_distribution<double>(0, 1000), ::std::mt19937_64(42));
std::unique_ptr<double[][2]> input(new double[count][2]);
for(size_t i = 0; i < count; ++i)
{
input[i][0] = gen();
input[i][1] = gen();
}
::xtd::stopwatch sw; // does what its name suggests
sw.start();
double minimum = calcDistance(input.get(), 0, count);
sw.stop();
::std::cout << minimum << "\n";
::std::cout << sw << "\n";
When executing your function with the omp pragma removed, its output is:
Windows x64 with icc 14.0
0.0559233
7045 ms
or
Windows x64 with msvc VS 2013 (18.00.21005)
0.0559233
7272 ms
When executing it with the omp pragma intact, its output is:
Windows x64 with icc 14.0
0.324085
675.9 ms
or
Windows x64 with msvc VS 2013 (18.00.21005)
0.0559233
4338 ms
Since the machine uses 24 threads (on 12 cores with HT enabled), the speedup is evident, but could possibly be better, at least for msvc. The compiler that generates a faster program (icc) also shows the data race by giving wrong results which are different each run.
Note: I was also able to see an incorrect result from msvc, when compiling a debug version for x86 with 10k iterations.
Proper usage of iteration-local variables
The temp
variable in your code has a lifetime of one iteration of the innermost loop. By moving its scope to match its lifetime, we can eliminate one source of data races. I also took the liberty of removing two unused variables and changing the initialization of minimum
to a constant:
double calcDistance(double input[][2], int start, int stop){
double minimum = ::std::numeric_limits<double>::infinity();
//#pragma omp parallel for // still broken
for(int i = start; i < stop; i++){
for(int j = start; j < stop; j++) {
double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][2]) - (input[i][3])), 2);
temp = sqrt(temp);
if(temp < minimum && i < j) minimum = temp;
}
}
return minimum;
}
A proper OMP minimum computation
OMP has support for reductions, which will in all probability perform fairly well. To try it, we will use the following pragma, which ensures that each thread works on its own minimum
variable which are combined using the minimum operator:
#pragma omp parallel for reduction(min: minimum)
The results validate the approach for ICC:
Windows x64 with icc 14.0
0.0559233
622.1 ms
But MSVC howls error C3036: 'min' : invalid operator token in OpenMP 'reduction' clause
, because it does not support minimum reductions. To define our own reduction, we will use a technique called double-checked locking:
double calcDistance(double input[][2], int start, int stop){
double minimum = ::std::numeric_limits<double>::infinity();
#pragma omp parallel for
for(int i = start; i < stop; i++){
for(int j = start; j < stop; j++) {
double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][1]) - (input[i][1])), 2);
temp = sqrt(temp);
if(temp < minimum && i < j)
{
#pragma omp critical
if(temp < minimum && i < j) minimum = temp;
}
}
}
return minimum;
}
This is not only correct, but also leads to comparable performance for MSVC (note that this is significantly faster than the incorrect version!):
Windows x64 with msvc VS 2013 (18.00.21005)
0.0559233
653.1 ms
The performance of ICC does not suffer significantly:
Windows x64 with icc 14.0
0.0559233
636.8 ms
Serial optimizations
While the above is a proper parallelization of your serial code, it can be significantly optimized by considering that you are computing a whole bunch of temp
results you are never going to use due to your i < j
condition.
By simply changing the start point of the inner loop, not only will this computation be completely avoided, it also simplifies the loop conditions.
Another trick we use is delaying the sqrt
computation until the last possible second, since it is a homomorphic transformation, we can just sort on the square of the distance.
Finally, calling pow
for a square is fairly inefficient, since it incurs a ton of overhead that we do not need.
This leads to the final code
double calcDistance(double input[][2], int start, int stop){
double minimum = ::std::numeric_limits<double>::infinity();
#pragma omp parallel for
for(int i = start; i < stop; i++) {
for(int j = i + 1; j < stop; j++) {
double dx = input[j][0] - input[i][0];
dx *= dx;
double dy = input[j][1] - input[i][1];
dy *= dy;
double temp = dx + dy;
if(temp < minimum)
{
#pragma omp critical
if(temp < minimum) minimum = temp;
}
}
}
return sqrt(minimum);
}
Leading to the final performance:
Windows x64 with icc 14.0
0.0559233
132.7 ms
or
Windows x64 with msvc VS 2013 (18.00.21005)
0.0559233
120.1 ms