accelerator directives not working
-
25-05-2021 - |
質問
This is the code for a matrix multiplication
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sum
sum=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$acc region do
do i=1,256
do j=1,256
sum=0
do k=1,256
sum=sum+a(i,k)*b(k,j)
c(i,j)=sum
end do
end do
end do
!$acc end region
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,c
end program ex
When I execute this the execution time is 75 msec when using the accelerator directives and the PGI compiler. But when I run same matrix multiplication with a "cuda fortran" implementation the execution time is only 5msec. So there is big difference even though I used the accelerator directives. So I doubt that my accelerator directives are working properly.
解決
I tried to accelerate your program using very similar accelerator directives OpenHMPP. Note that I switched one your line, that is probably errorneously in the innermost loop. Also note, that I had to advice the compiler of the reduction taking place. Also I renamed the reduction variable, because it shadowed the sum
intrinsic function.
The performance is not good, because of the overheead with starting the GPU kernel and because of the memory transfers. You need orders of magnitude more work for it to be profitable to use GPU.
For example when I used matrices 2000 x 2000 then the CPU execution time was 41 seconds, but GPU execution time only 8 s.
program ex
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sm
sm=0
do j = 1,256
do i = 1,256
a(i,j) = 1
b(i,j) = 1
c(i,j) = 0.0
enddo
enddo
call cpu_time(t1)
!$hmpp region, target = CUDA
!$hmppcg gridify, reduce(+:sm)
do i=1,256
do j=1,256
sm=0
do k=1,256
sm=sm+a(i,k)*b(k,j)
end do
c(i,j)=sm
end do
end do
!$hmpp endregion
call cpu_time(t2)
print*,"cpu time=",t2-t1
print*,sum(c)
end program ex
edit: it would be probably not to use reduce(+:sm)
, but just private(sm)
他のヒント
FYI, the OP also posted this question on the PGI User Forum (http://www.pgroup.com/userforum/viewtopic.php?t=3081). We believe the original issue was the result of pilot error. When we profiled his code using CUDA Prof, the CUDA Fortran kernel execution time was 205 ms versus 344 ms using the PGI Accelerator Model. Also, if I fix his code so that "c(i,j)=sum" is placed outside of the inner "k" loop, the PGI Accelerator Model time reduces to 123ms. It's unclear how he gathered his timings.
Thanks to those that tried to help. - Mat