The OpenACC runtime library provides routines (acc_set_device_num()
, acc_get_device_num()
) to select which accelerator device will be targetted by a particular thread, but it's not convenient to use a single thread to use multiple devices simultaneously. Instead, either OpenMP or MPI can be used.
For example (lifting from here) a basic framework for OpenMP might be:
#include <openacc.h>
#include <omp.h>
#pragma omp parallel num_threads(2)
{
int i = omp_get_threadnum();
acc_set_device_num( i, acc_device_nvidia );
#pragma acc data copy...
{
}
}
It can also be done with MPI, and/or you could use MPI to communicate between nodes, as is typical.