OpenACC red-black Gauss-Seidel slower than CPU

https://stackoverflow.com/questions/12979870

gpu
openacc

09-07-2021
|

سؤال

I added OpenACC directives to my red-black Gauss-Seidel solver for the Laplace equation (a simple heated plate problem), but the GPU-accelerated code is no faster than the CPU, even for large problems.

I also wrote a CUDA version, and that is much faster than both (for 512x512, on the order of 2 seconds compared to 25 for CPU and OpenACC).

Can anyone think of a reason for this discrepancy? I realize that CUDA offers the most potential speed, but OpenACC should give something better than the CPU for larger problems (like the Jacobi solver for the same sort of problem demonstrated here).

Here is the relevant code (the full working source is here):

#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
  Real norm_L2 = 0.0;

  // update red cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
      reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_red = col * ((NUM / 2) + 2) + row;        // local (red) index
      int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1);  // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_red[ind_red];
      temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_red[ind_red] - temp_old;
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // update black cells
  #pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
          reduction(+:norm_L2)
  #pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
  #pragma acc loop independent gang vector(4)
  for (int col = 1; col < NUM + 1; ++col) {
    #pragma acc loop independent gang vector(64)
    for (int row = 1; row < (NUM / 2) + 1; ++row) {

      int ind_black = col * ((NUM / 2) + 2) + row;      // local (black) index
      int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1);    // global index

      #pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])

      Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
                         + aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
                         + aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
                         + aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);

      Real temp_old = temp_black[ind_black];
      temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);

      // calculate residual
      res = temp_black[ind_black] - temp_old;       
      norm_L2 += (res * res);

    } // end for row
  } // end for col

  // calculate residual
  norm_L2 = sqrt(norm_L2 / ((Real)size));

  if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

  // if tolerance has been reached, end SOR iterations
  if (norm_L2 < tol) {
    break;
  }
}

المحلول

Alright, I found a semi-solution that reduces the time somewhat significantly for smaller problems.

If I insert the lines:

acc_init(acc_device_nvidia);
acc_set_device_num(0, acc_device_nvidia);

before I start my timer, in order to activate and set the GPU, the time for the 512x512 problem drops to 9.8 seconds, and down to 42 for 1024x1024. Increasing the problem size further shows how fast even OpenACC can be compared to running on four CPU cores.

With this change, the OpenACC code is on the order of 2x slower than the CUDA code, with the gap getting closer to just a bit slower (~1.2) as the problem size gets bigger and bigger.

نصائح أخرى

I download your full code and i compiled and run it! Did't stop run and for instruction

if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);

the result was:

100, nan

200, nan

....

I changed all variables with type Real into type float and the result was:

100, 0.000654

200, 0.000370

..., ....

..., ....

8800, 0.000002

8900, 0.000002

9000, 0.000001

9100, 0.000001

9200, 0.000001

9300, 0.000001

9400, 0.000001

9500, 0.000001

9600, 0.000001

9700, 0.000001

CPU

Iterations: 9796

Total time: 5.594017 s

With NUM = 1024 the result was:

Iterations: 27271

Total time: 25.949905 s

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow