Defining functions in C++AMP

Question 1

A function must follow a number of rules to successfully compile with restrict(amp). The first, as mentioned in the parallel_for_each()section, involves functions that it calls. Those must be visible at code generation time and must also be marked with restrict(amp). If you are not using link time code generation, this essentially means they must be in the same .cpp file by compile time, possibly from a header file included in that .cpp file. If you are using /ltcg when compiling both .cpp files (the one that calls the function and the one that implements it) as well as when linking, then you can keep the calling and called functions in separate files.

A C++ AMP-compatible function or lambda can only use C++ AMP-compatible types, which include the following:

int
unsigned int
float
double
C-style arrays of int, unsigned int, float, or double
concurrency::array_view or references to concurrency::array
structs containing only C++ AMP-compatible types

This means that some data types are forbidden:

bool(can be used for local variables in the lambda)
char
short
long long
unsigned versions of the above

References and pointers (to a compatible type) may be used locally but cannot be captured by a lambda. Function pointers, pointer-to-pointer, and the like are not allowed; neither are static or global variables.

Classes must meet more rules if you wish to use instances of them. They must have no virtual functions or virtual inheritance. Constructors, destructors, and other nonvirtual functions are allowed. The member variables must all be of compatible types, which could of course include instances of other classes as long as those classes meet the same rules.

The actual code in your amp-compatible function is not running on a CPU and therefore can’t do certain things that you might be used to doing:

recursion
pointer casting
use of virtual functions
new or delete
RTTI or dynamic casting

Here's an example which does exactly what you are trying to do I think but does not use tiling. The shift parameter is the size (radius) of the square pixel mask. In this example I don't try and calculate new values for the elements so close to the edge of the array. In order to not waste threads on these elements where there is no work the parallel_for_each takes an extent that is shift * 2 elements smaller than the array. The corrected index, idc, adjusts the idx value based on the extent to refer to the correct element.

void MatrixSingleGpuExample(const int rows, const int cols, const int shift)
{
    //  Initialize matrices

    std::vector<float> vA(rows * cols);
    std::vector<float> vC(rows * cols);
    std::iota(vA.begin(), vA.end(), 0.0f);

    //  Calculation

    accelerator_view view = accelerator(accelerator::default_accelerator).default_view;
    double time = TimeFunc(view, [&]()
    {
        array_view<const float, 2> a(rows, cols, vA); 
        array_view<float, 2> c(rows, cols, vC);
        c.discard_data();

        extent<2> ext(rows - shift * 2, cols - shift * 2);
        parallel_for_each(view, ext, [=](index<2> idx) restrict(amp)
        {
            index<2> idc(idx[0] + shift, idx[1] + shift);
            c[idc] = WeightedAverage(idc, a, shift);
        });
        c.synchronize();
    });
}

float WeightedAverage(index<2> idx, const array_view<const float, 2>& data, int shift) 
    restrict(amp)
{
    if (idx[1] < shift || idx[1] >= data.extent[1] - shift)
        return 0.0f;
    float max = fast_math::sqrtf((float)(shift * shift * 2));
    float avg = 0.0;
    float n = 0.0f;
    for (int i = -shift; i <= shift; ++i)
        for (int j = -shift; j <= shift; ++j)
        {
            int row = idx[0] + i;
            int col = idx[1] + i;
            float scale = 1 - fast_math::sqrtf((float)((i * i) * (j * j))) / max;
            avg += data(row,col) * scale;
            n += 1.0f;
        }
    avg /= n;
    return avg;
}

Question 2

Yes, you need to annotate the function signature with restrict(amp) or restrict(cpu, amp) if you want to be able to call the same function from CPU code. See the MSDN docs on restrict.