Frage

I'm trying to use shared memory to cache things with OpenACC.

Basically what I'm working on is a matrix multiplication, and what I have is this:

typedef float ff; 

// Multiplies two square row-major matrices a and b, puts the result in c. 
void mmul(const restrict ff* a, 
          const restrict ff* b, 
          restrict ff* c, 
          const int n) { 
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n]) 
{ 

#pragma acc region 
{ 

#pragma acc loop independent vector(16) 
  for (int i = 0; i < n; ++i) { 
#pragma acc loop independent vector(16) 
    for (int j = 0; j < n; ++j) { 
      ff sum = 0; 
      for (int k = 0; k < n; ++k) { 
        sum += a[i + n * k] * b[k + n * j]; 
      } 
      c[i + n * j] = sum; 
    } 
  } 

} 
}
}

What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.

Basically on CUDA I would know the exact size of my blocks, and would be able to:

  • declare a shared memory with the size of the block
  • copy the 'relevant' part of the data to the block
  • use this data

I understand I can use the

#pragma acc cached

directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.

Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

War es hilfreich?

Lösung

If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:

pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult

The generated PTX will be stored in the current directory.

EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top