optimize MSE algorithm using openmp

Question 1

I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:

double sqError[4] = {0};
const int numBytes = width * height * 2;

#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
    for( int i = elem; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        sqError[elem] += (double)(val*val);
    }
}

This way, each thread operates exclusively on one thing and there is no contention.

Maybe it's not the most advanced use of OMP, but you should see a speedup.

After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.

You said:

performance hit this time with openMP : Time :0.040637 with serial Time :0.018670

So I reworked it using the reduction on each variable and using a single loop:

    #pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
    for( int i = 0; i < numBytes; i += 4 ) {
        int val = refData[i] - calData[i];
        e0 += (double)(val*val);
        val = refData[i+1] - calData[i+1];
        e1 += (double)(val*val);
        val = refData[i+2] - calData[i+2];
        e2 += (double)(val*val);
        val = refData[i+3] - calData[i+3];
        e3 += (double)(val*val);
    }

With my test case on a 4-core machine, I observed a little less than 4-fold improvement:

serial:             2025 ms
omp with 2 loops:   6850 ms
omp with reduction: 455  ms

[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:

Your first piece of code is a terrible example of what false sharing does in multithreaded codes. As sqError has only 4 elements of 8 bytes each, it fits in a single cache line (even in a half cache line on modern x86 CPUs). With 4 threads constantly writing to neighbouring elements, this would generate a massive amount of inter-core cache invalidation due to false sharing. One can get around this by using instead a structure like this struct _error { double val; double pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate cache line, hence no false sharing.

Question 2

The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.

Also, test that this is actually faster!