Question

I have the following piece of code that gives the resulting time from multiplying 2 matrix of 1024x1024 fields:

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h>

#define NUM 1024
float a[NUM][NUM],b[NUM][NUM],c[NUM][NUM]; 
void initialize_matrix(float m[][NUM]); 
void load_matrix(float m[][NUM]); 

int main() {

    int i,j,k; 

    clock_t t_inicial,t_final; 
    load_matrix(a); 
    load_matrix(b); 
    initialize_matrix(c); 

    printf("Starting matrix multiplication 1024x1024...\n\n");

    t_inicial=clock();
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    for(k=0;k<NUM;k++) 
    c[i][j] =c[i][j] + a[i][k] * b[k][j]; 
    t_final=clock();

    printf("Matrix multiplication finished in: %3.6f seconds",((float) t_final- (float)t_inicial)/CLOCKS_PER_SEC);

} 

void initialize_matrix(float m[][NUM]) {

    int i,j;
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    m[i][j]=0.0; 
    return;

} 
void load_matrix(float m[][NUM]) {

    int i,j;
    #pragma omp parallel for
    for(i=0;i<NUM;i++) 
    for(j=0;j<NUM;j++) 
    m[i][j]=(float) 10*rand()/(float) rand(); 
    return;

}

This code takes 24 seconds to solve it, I've been told that there's some problem with the cache memory. How can I improve this code so it takes less time? It's not very nice that it takes so long.

Was it helpful?

Solution

Just exchanging

for(j=0;j<NUM;j++) 
for(k=0;k<NUM;k++) 

with

for(k=0;k<NUM;k++) 
for(j=0;j<NUM;j++) 

I have had a 43x speedup. Like you said, improving cache locality.

Some milliseconds can even be shaven by blocking, i.e., exchanging

for(j=0;j<NUM;j++) 
  for(j=0;j<NUM;j++) 
    for(k=0;k<NUM;k++) 

by

for(int i0=0; i0<NUM; i0+=BLK)
  for(int k0=0; k0<NUM; k0+=BLK)
    for(int j0=0; j0<NUM; j0+=BLK)
      for(int i=i0, ix=i0+BLK; i<ix; ++i)
        for(int k=k0, kx=k0+BLK; k<kx; ++k)
          for(int j=j0, jx=j0+BLK; j<jx; ++j)

(my best runs were with #define BLK 256, but YMMV).

CLARIFYING: this is the answer given the link referenced by @Adriano, and you really should have looked it before editing the question.

OTHER TIPS

It depends upon the compiler that you are using. With GCC you could use __builtin_prefetch, of course after having asked for optimization (compiling with gcc -O3 -mtune=native). Use it carefully after benchmarking, see this answer (e.g. in the i or j loop, to fetch the next row).

Your code is very regular, so could be using OpenMP directives to the compiler. And you could even consider coding some OpenCL kernel to take advantage of your GPGPU.

A straightforward implementation of matrix multiply is not very cache friendly. The comment that you were told is probably referring to blocking, which will do the multiplication in blocks to improve locality. Here is one reference. If you Google for "cache block matrix multiplication" you will get other hits.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top