Question

I wonder something related kernels structure. May not the every line inside kernels work on GPU?

for example i have this code:

#pragma acc kernels copy(a[0:n],b[0:n])
    {
        #pragma acc loop
        for (i = 0; i < n; i++)
            a[i] = i+10;   
        a[1] = 10;
        a[3] = 5;
        #pragma acc loop
        for (i = 0; i < n; i++)
            b[i] = i+20;
    }

Also Is the the situation same for acc parallel structure?

Was it helpful?

Solution

Quoting the spec, about kernels construct:

The compiler will break the code in the kernels region into a sequence of accelerator kernels. Typically, each loop nest will be a distinct kernel. When the program encounters a kernels construct, it will launch the sequence of kernels in order on the device.

So the sequence

a[1] = 10;
a[3] = 5;

that you have put between the two loops could be executed on the device. Problem is, since this code is not in a loop, the OpenACC compiler will have to create a "fake" loop with just one iteration to execute it on the GPU. Since it's often slower to do this, some OpenACC compilers prefer to execute such sequential lines on the host, after having downloaded the data.

For parallel sections, the answer is simpler: all code is always executed on the device.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top