accelerator directives not working

https://stackoverflow.com/questions/9789005

25-05-2021
|

質問

This is the code for a matrix multiplication

 program ex
    implicit none
    real :: a(256,256),b(256,256),c(256,256),t1,t2
    integer i,j,k,sum
    sum=0

    do j = 1,256
      do i = 1,256
        a(i,j) = 1
        b(i,j) = 1
        c(i,j) = 0.0
      enddo
    enddo

    call cpu_time(t1)
    !$acc region do

    do i=1,256
      do j=1,256
        sum=0
        do k=1,256
          sum=sum+a(i,k)*b(k,j)
          c(i,j)=sum
        end do
      end do
    end do
    !$acc end region
    call cpu_time(t2)
    print*,"cpu time=",t2-t1
    print*,c
  end program ex

When I execute this the execution time is 75 msec when using the accelerator directives and the PGI compiler. But when I run same matrix multiplication with a "cuda fortran" implementation the execution time is only 5msec. So there is big difference even though I used the accelerator directives. So I doubt that my accelerator directives are working properly.

解決

I tried to accelerate your program using very similar accelerator directives OpenHMPP. Note that I switched one your line, that is probably errorneously in the innermost loop. Also note, that I had to advice the compiler of the reduction taking place. Also I renamed the reduction variable, because it shadowed the sum intrinsic function.

The performance is not good, because of the overheead with starting the GPU kernel and because of the memory transfers. You need orders of magnitude more work for it to be profitable to use GPU.

For example when I used matrices 2000 x 2000 then the CPU execution time was 41 seconds, but GPU execution time only 8 s.

 program ex
    implicit none
    real :: a(256,256),b(256,256),c(256,256),t1,t2
    integer i,j,k,sm

      sm=0
      do j = 1,256
          do i = 1,256
             a(i,j) = 1
             b(i,j) = 1
             c(i,j) = 0.0
          enddo
       enddo
       call cpu_time(t1)
     !$hmpp region, target = CUDA
      !$hmppcg gridify, reduce(+:sm)
      do i=1,256

          do j=1,256

               sm=0
               do k=1,256

                   sm=sm+a(i,k)*b(k,j)
               end do
               c(i,j)=sm
          end do
      end do
     !$hmpp endregion
      call cpu_time(t2)
      print*,"cpu time=",t2-t1
      print*,sum(c)
end program ex

edit: it would be probably not to use reduce(+:sm), but just private(sm)

他のヒント

FYI, the OP also posted this question on the PGI User Forum (http://www.pgroup.com/userforum/viewtopic.php?t=3081). We believe the original issue was the result of pilot error. When we profiled his code using CUDA Prof, the CUDA Fortran kernel execution time was 205 ms versus 344 ms using the PGI Accelerator Model. Also, if I fix his code so that "c(i,j)=sum" is placed outside of the inner "k" loop, the PGI Accelerator Model time reduces to 123ms. It's unclear how he gathered his timings.

Thanks to those that tried to help. - Mat

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow