Question

I am working on a code which includes a loop with many iterations (~10^6-10^7) where an array (let's say, 'myresult') is being calculated via summation over lots of contributions. In Fortran 90 with OpenMP, this will look something like:

!$omp parallel do
!$omp& reduction(+:myresult)
do i=1,N
 myresult[i] = myresult[i] + [contribution]
enddo
!$omp end parallel

The code will be run on a system with Intel Xeon coprocessors, and would of course like to benefit from their existence, if possible. I have tried using MIC offloading statements (!dir$ offload target ...) with OpenMP so that the loop runs on just the coprocessor, but then I am wasting host CPU time while it sits there waiting for the coprocessor to finish. Ideally, one could divide up the loop between the host and the device, so I would like to know if something like the following is feasible (or if there is a better approach); the loop will only run on one core on the host (though perhaps with OMP_NUM_THREADS=2?):

!$omp parallel sections
!$omp& reduction(+:myresult)

!$omp section ! parallel calculation on device
!dir$ offload target mic
!$omp parallel do
!$omp& reduction(+:myresult)
(do i=N/2+1,N)
!$omp end parallel do

!$omp section ! serial calculation on host
(do i=1,N/2)

!$omp end parallel sections
Was it helpful?

Solution

The general idea would be to use an asynchronous offload to MIC so that the CPU could continue. Setting aside the details of how to divide the work, this is how it is expressed:

module m
!dir$ attributes offload:mic :: myresult, micresult
integer :: myresult(10000)
integer :: result
integer :: micresult
end module

use m
N = 10000
result = 0
micresult = 0
myresult = 0
!dir$ omp offload target(mic:0) signal(micresult)
!$omp parallel do reduction(+:micresult)
do i=N,N/2
 micresult = myresult(i) + 55
enddo
!$omp end parallel do

!$omp parallel do reduction(+:result)
do i=1,N/2
 result = myresult(i) + 55
enddo
!$omp end parallel do

!dir$ offload_wait target(mic:0) wait(micresult)
result = result + micresult
end

OTHER TIPS

Have you considered to use MPI symmetric mode instead of offload? In case you haven't, MPI can do what you just describe: you start two MPI ranks, one on host and one on co-processor. Each rank performs a parallel loop using OpenMP.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top