Matrix multiplication by vector OpenMP C [duplicate]

Question 1

I essentially already answer this question parallelizing-matrix-times-a-vector-by-columns-and-by-rows-with-openmp.

You have a race condition when you write to results[y]. To fix this, and still parallelize the inner loop, you have to make private versions of results[y], fill them in parallel, and then merge them in a critical section.

In the code below I assume you're using double, replace it with float or int or whatever datatype you're using (note that your inner loop goes over the first index of matrix[i][y] which is cache unfriendly).

#pragma omp parallel num_threads(4)
{
    int y,i;
    double* results_private = (double*)calloc(matrix_size, sizeof(double));
    for(y = 0; y < matrix_size ; y++) {
        #pragma omp for
        for(i = 0; i < matrix_size; i++) {
            results_private[y] += vector[i]*matrix[i][y];   
        }
    }
    #pragma omp critical
    {
        for(y=0; y<matrix_size; y++) results[y] += results_private[y];
    }
    free(results_private);
}

If this is homework assignment and you want to really impress your instructor then it's possible to do the merging without a critical section. See this link to get an idea on what to do fill-histograms-array-reduction-in-parallel-with-openmp-without-using-a-critic though I can't promise it will be faster.

Question 2

I've not done any parallel programming for a while now, nor any Maths for that matter, but don't you want to split the rows of the matrix in parallel, rather than the columns?

What happens if you try this:

double start_time = clock();
#pragma omp parallel private(i) num_threads(4)
{
tid = omp_get_thread_num();
world_size = omp_get_num_threads();
printf("Threads: %d\n",world_size);

#pragma omp parallel for private(y) shared(results, vector, matrix)
for(y = 0; y < matrix_size ; y++){

    for(i = 0; i < matrix_size; i++){
            results[y] = results[y] + vector[i]*matrix[i][y];   
    }
}
}
double end_time = clock();
double result_time = (end_time - start_time) / CLOCKS_PER_SEC;
printf("Time: %f\n", result_time);

Also, are you sure everything's OK compiling and linking with openMP?

Question 3

You have a typical case of cache conflicts.

Consider that a cache line on your CPU is probably 64 bytes long. Having one processor/core write to the first 4 bytes (float) causes that cache line to be invalidated on every other L1/L2 and maybe L3. This is a lot of overhead.

Partition your data better!

 #pragma omp parallel for private(i) shared(results, vector, matrix) schedule(static,16)

should do the trick. Increase the chunksize if this does not help.

Another optimisation is to store the result locally before you flush it down to memory.

Also, this is an OpenMP thing, but you don't need to start a new parallel region for the loop (each mention of parallel starts a new team):

#pragma omp parallel default(none) \
        shared(vector, matrix) \
        firstprivate(matrix_size) \
        num_threads(4)
{
    int i, y;
    #pragma omp for schedule(static,16)
    for(y = 0; y < matrix_size ; y++){
        double result = 0;
        for(i = 0; i < matrix_size; i++){
                results += vector[i]*matrix[i][y];   
        }
        result[y] = result;
    }
}