OpenMP C++ Matrix Multiplication run slower in parallel

Question 1

Your problem is due to a race condition on the inner loop variable j. It needs to be made private.

For C89 I would do something like this:

#pragma omp parallel
{
    int i, j, k;
    #pragma omp for
    for(i=0; ...

For C++ or C99 use mixed declarations

#pragma omp parallel for
for(int i=0; ...

Doing this you don't have to explicitly declare anything shared or private.

Some further comments to your code. Your single threaded code is not cache friendly when you do B[k][j]. This reads a cacheline then moves to the next cache line and so forth until the dot product is done by which time the other cachelines have been evicted. Instead you should take the transpose first and access as BT[j][k]. Additionally, you have allocated arrays of arrays and not one contiguous 2D array. I fixed your code to use the transpose and a contiguous 2D array.

Here are the times I get for size=512.

no transpose  no openmp 0.94s
no transpose, openmp    0.23s
tranpose, no openmp     0.27s
transpose, openmp       0.08s

Below is the code (also see http://coliru.stacked-crooked.com/a/ee174916fa035f97)

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void transpose(double *A, double *B, int n) {
    int i,j;
    for(i=0; i<n; i++) {
        for(j=0; j<n; j++) {
            B[j*n+i] = A[i*n+j];
        }
    }
}

void gemm(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B[k*n+j];
            } 
            C[i*n+j ] = dot;
        }
    }
}

void gemm_omp(double *A, double *B, double *C, int n) 
{   
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B[k*n+j];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
}

void gemmT(double *A, double *B, double *C, int n) 
{   
    int i, j, k;
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    for (i = 0; i < n; i++) { 
        for (j = 0; j < n; j++) {
            double dot  = 0;
            for (k = 0; k < n; k++) {
                dot += A[i*n+k]*B2[j*n+k];
            } 
            C[i*n+j ] = dot;
        }
    }
    free(B2);
}

void gemmT_omp(double *A, double *B, double *C, int n) 
{   
    double *B2;
    B2 = (double*)malloc(sizeof(double)*n*n);
    transpose(B,B2, n);
    #pragma omp parallel
    {
        int i, j, k;
        #pragma omp for
        for (i = 0; i < n; i++) { 
            for (j = 0; j < n; j++) {
                double dot  = 0;
                for (k = 0; k < n; k++) {
                    dot += A[i*n+k]*B2[j*n+k];
                } 
                C[i*n+j ] = dot;
            }
        }

    }
    free(B2);
}

int main() {
    int i, n;
    double *A, *B, *C, dtime;

    n=512;
    A = (double*)malloc(sizeof(double)*n*n);
    B = (double*)malloc(sizeof(double)*n*n);
    C = (double*)malloc(sizeof(double)*n*n);
    for(i=0; i<n*n; i++) { A[i] = rand()/RAND_MAX; B[i] = rand()/RAND_MAX;}

    dtime = omp_get_wtime();
    gemm(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemm_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    dtime = omp_get_wtime();
    gemmT_omp(A,B,C, n);
    dtime = omp_get_wtime() - dtime;
    printf("%f\n", dtime);

    return 0;

}

Question 2

In addition. "Z boson", I have tested your C code on the laptop with intel i5 (2 physical cores or 4 logical ones). Unfortunately, the calculation speed is not very fast. For 2000x2000 random double matrices I obtained the following results (using VS 2010 with OpenMP 2.0):

Compiled for Win64: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.303555 s
no transpose no openmp = 100.539924 s
no transpose, openmp = 47.876084 s
transpose, no openmp = 27.872169 s
transpose, openmp = 15.821010 s

Compiled for Win32: C = A*B, where A,B are matrices with the size (2000x2000):

max number of threads = 4
Create random matrices: = 0.378804 s
no transpose no openmp = 98.613992 s
no transpose, openmp = 48.233655 s
transpose, no openmp = 29.590350 s
transpose, openmp = 13.678097 s

Note that for the "Hynek Blaha" code the calculation time on my system is 739.208s (226.62s with openMP)!

Whereas in Matlab x64:

n = 2000; 
A = rand(n); B = rand(n);

tic
C = A*B;
toc

the calculation time is 0.591440 seconds.

But using openBLAS package I reached a speed of 0.377814 seconds (using minGW with openMP 4.0). The Armadillo package provides a simple way (in my opinion) for connection of matrix operations with openBLAS (or other similar packages). In this case the code is

#include <iostream>
#include <armadillo>
using namespace std;
using namespace arma;

int main(){
    int n = 2000;
    int N = 10; // number of repetitions
    wall_clock timer;

    arma_rng::set_seed_random();

    mat A(n, n, fill::randu), B(n, n, fill::randu);

    timer.tic();
    // repeat simulation N times
    for(int n=1;n<N;n++){
      mat C = A*B;
    }
    cout << timer.toc()/double(N) << "s" << endl;

    return 0;
}

Question 3

If size is small, the overhead of thread-synchronization will shadow any performance gain from parallel computation.