Frage

I'm using caps openacc compiler. I wonder something that can i manage memory myself?

For example regular openacc code with CUDA is :

 #pragma acc kernels copyin(a,b) copy(c)
  for (i = 0; i < SIZE; ++i)
    for (j = 0; j < SIZE; ++j)
      for (k = 0; k < SIZE; ++k)
        c[i][j] += a[i][k] * b[k][j];

I want change in this way

//allocation
cudaMalloc((void**)&a, num_bytes);
cudaMalloc((void**)&b, num_bytes);
cudaMalloc((void**)&c, num_bytes);

//transfer-in
cudaMemcpy(hostA, a, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(hostB, b, num_bytes, cudaMemcpyHostToDevice);

//computation
//i think it will be generated as codelet by CAPS openACC compiler.
#pragma acc kernels
  for (i = 0; i < SIZE; ++i)
    for (j = 0; j < SIZE; ++j)
      for (k = 0; k < SIZE; ++k)
        c[i][j] += a[i][k] * b[k][j];

cudaMemcpy(c, hostC, num_bytes, cudaMemcpyDeviceToHost);
cudaFree(&a);cudaFree(&b);cudaFree(&c);
War es hilfreich?

Lösung

Yes, you can allocate the memory yourself. In your example it should be possible to achieve this using the device_ptr pragma, so something like:

cudaMalloc((void**)&a, num_bytes);
cudaMalloc((void**)&b, num_bytes);
cudaMalloc((void**)&c, num_bytes);

cudaMemcpy(hostA, a, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy(hostB, b, num_bytes, cudaMemcpyHostToDevice);

#pragma acc data deviceptr(a, b, c)
#pragma acc kernels
  for (i = 0; i < SIZE; ++i)
    for (j = 0; j < SIZE; ++j)
      for (k = 0; k < SIZE; ++k)
        c[i][j] += a[i][k] * b[k][j];

cudaMemcpy(c, hostC, num_bytes, cudaMemcpyDeviceToHost);
cudaFree(a);cudaFree(b);cudaFree(c);

[disclaimer: written in browser, never compiled or tested, use at own risk]

This should declare that a, b and c are pre-existing allocations to the compiler. You should also be able to use the OpenACC acc_malloc routine to allocate memory in place of cudaMalloc, if you so wish.

Thanks to @user2054656 for pointing out my incorrect use of device_resident in the first version of this answer.

Andere Tipps

I agree with the talonmies' post, except that you should use the deviceptr clause of the data pragma. I say this because device_resident does request the OpenACC implementation to allocate memory, while device_ptr does not. You don't need to allocate memory because it's already allocated by the user with cudaMalloc()

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top