Parallelizing matrix times a vector by columns and by rows with OpenMP

https://stackoverflow.com/questions/23277508

09-07-2023
|

Question

For some homework I have, I need to implement the multiplication of a matrix by a vector, parallelizing it by rows and by columns. I do understand the row version, but I am a little confused in the column version.

Lets say we have the following data:

Matix times vector

And the code for the row version:

#pragma omp parallel default(none) shared(i,v2,v1,matrix,tam) private(j)
  {
#pragma omp for
    for (i = 0; i < tam; i++)
      for (j = 0; j < tam; j++){
//        printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here the calculations are done right and the result is correct.

The column version:

#pragma omp parallel default(none) shared(j,v2,v1,matrix,tam) private(i)
  {
    for (i = 0; i < tam; i++)
#pragma omp for
      for (j = 0; j < tam; j++) {
//            printf("Hebra %d hizo %d,%d\n", omp_get_thread_num(), i, j);
        v2[i] += matrix[i][j] * v1[j];
      }
  }

Here, due to how the parallelization is done, the result varies on each execution depending on who thread execute each column. But it happens something interesting, (And I would think is because of compiler optimizations) if I uncomment the printf then the results all the same as the row version and therefore, correct, for example:

Thread 0 did 0,0
Thread 2 did 0,2
Thread 1 did 0,1
Thread 2 did 1,2
Thread 1 did 1,1
Thread 0 did 1,0
Thread 2 did 2,2
Thread 1 did 2,1
Thread 0 did 2,0

 2.000000  3.000000  4.000000 
 3.000000  4.000000  5.000000 
 4.000000  5.000000  6.000000 


V2:
20.000000, 26.000000, 32.000000,

Is right, but If I remove the printf:

V2:
18.000000, 11.000000, 28.000000,

What kind of mechanism should I use to get the column version right?

Note: I care more about the explanation rather than the code you may post as answer, because what I really want is understand what is going wrong in the column version.

EDIT

I've found a way of get rid of the private vector proposed by Z boson in his answer. I've replaced that vector by a variable, here is the code:

    #pragma omp parallel
      {
        double sLocal = 0;
        int i, j;
        for (i = 0; i < tam; i++) {
    #pragma omp for
          for (j = 0; j < tam; j++) {
            sLocal += matrix[i][j] * v1[j];
          }
    #pragma omp critical
          {
            v2[i] += sLocal;
            sLocal = 0;
          }
        }
      }

Solution

I don't know exactly what your homework means by parallelize along row and column but I know why your code is not working. You have a race condition when you write to v2[i]. You can fix it by making private versions of v2[i], filling them in parallel, and then merging them with a critical section.

#pragma omp parallel
{
    float v2_private[tam] = {};
    int i,j;
    for (i = 0; i < tam; i++) {
        #pragma omp for
        for (j = 0; j < tam; j++) {
            v2_private[i] += matrix[i][j] * v1[j];
        }
    }
    #pragma omp critical
    {
        for(i=0; i<tam; i++) v2[i] += v2_private[i];
    }
}

I tested this. You can see the results here http://coliru.stacked-crooked.com/a/5ad4153f9579304d

Note that I did not explicitly define anything shared or private. It's not necessary to do. Some people think you should explicitly define everything. I personalty think the opposite. By defining i and j (and v2_private) inside the parallel section they are made private.

OTHER TIPS

I'd say the row version is more efficient because there is no private storage required for each thread and there is no need to use critical section or mutex for partial summation. The code is also much simpler:

#pragma omp parallel for
for (int i = 0; i < tam; i++) {
    for (int j = 0; j < tam; j++) {
        v2[i] += matrix[i][j] * v1[j];
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow