Several arithmetic operations parallelized in C++Amp

Question

You're no the right track but doing in place manipulations of arrays on a GPU is tricky as you cannot guarantee the order in which different elements are updated.

Here's an example of something very similar. The ApplyColorSimplifierTiledHelper method contains an AMP restricted parallel_for_each that calls SimplifyIndexTiled for each index in the 2D array. SimplifyIndexTiled calculates a new value for each pixel in destFrame based on the value of the pixels surrounding the corresponding pixel in srcFrame. This solves the race condition issue present in your code.

This code comes from the Codeplex site for the C++ AMP book. The Cartoonizer case study includes several examples of these sorts of image processing problems implemented in C++ AMP using; arrays, textures, tiled/untiled and multi-GPU. The C++ AMP book discusses the implementation in some detail.

void ApplyColorSimplifierTiledHelper(const array<ArgbPackedPixel, 2>& srcFrame,
    array<ArgbPackedPixel, 2>& destFrame, UINT neighborWindow)
{
    const float_3 W(ImageUtils::W);

    assert(neighborWindow <= FrameProcessorAmp::MaxNeighborWindow);

    tiled_extent<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize>     
        computeDomain = GetTiledExtent(srcFrame.extent);
    parallel_for_each(computeDomain, [=, &srcFrame, &destFrame]
        (tiled_index<FrameProcessorAmp::TileSize, FrameProcessorAmp::TileSize> idx) 
        restrict(amp)
    {
        SimplifyIndexTiled(srcFrame, destFrame, idx, neighborWindow, W);
    });
}

void SimplifyIndex(const array<ArgbPackedPixel, 2>& srcFrame, array<ArgbPackedPixel,
                   2>& destFrame, index<2> idx, 
                   UINT neighborWindow, const float_3& W) restrict(amp)
{
    const int shift = neighborWindow / 2;
    float sum = 0;
    float_3 partialSum;
    const float standardDeviation = 0.025f;
    const float k = -0.5f / (standardDeviation * standardDeviation);

    const int idxY = idx[0] + shift;         // Corrected index for border offset.
    const int idxX = idx[1] + shift;
    const int y_start = idxY - shift;
    const int y_end = idxY + shift;
    const int x_start = idxX - shift;
    const int x_end = idxX + shift;

    RgbPixel orgClr = UnpackPixel(srcFrame(idxY, idxX));

    for (int y = y_start; y <= y_end; ++y)
        for (int x = x_start; x <= x_end; ++x)
        {
            if (x != idxX || y != idxY) // don't apply filter to the requested index, only to the neighbors
            {
                RgbPixel clr = UnpackPixel(srcFrame(y, x));
                float distance = ImageUtils::GetDistance(orgClr, clr, W);
                float value = concurrency::fast_math::pow(float(M_E), k * distance * distance);
                sum += value;
                partialSum.r += clr.r * value;
                partialSum.g += clr.g * value;
                partialSum.b += clr.b * value;
            }
        }

    RgbPixel newClr;
    newClr.r = static_cast<UINT>(clamp(partialSum.r / sum, 0.0f, 255.0f));
    newClr.g = static_cast<UINT>(clamp(partialSum.g / sum, 0.0f, 255.0f));
    newClr.b = static_cast<UINT>(clamp(partialSum.b / sum, 0.0f, 255.0f));
    destFrame(idxY, idxX) = PackPixel(newClr);
}

The code uses ArgbPackedPixel, which is simply a mechanism for packing 8-bit RGB values into an unsigned long as C++ AMP does not support char. If your problem is small enough to fit into a texture then you may want to look at using this instead of an array as the pack/unpack is implemented in hardware on the GPU so is effectively "free", here you have to pay for it with additional compute. There is also an example of this implementation on CodePlex.

typedef unsigned long ArgbPackedPixel;

struct RgbPixel 
{
    unsigned int r;
    unsigned int g;
    unsigned int b;
};

const int fixedAlpha = 0xFF;

inline ArgbPackedPixel PackPixel(const RgbPixel& rgb) restrict(amp) 
{
    return (rgb.b | (rgb.g << 8) | (rgb.r << 16) | (fixedAlpha << 24));
}


inline RgbPixel UnpackPixel(const ArgbPackedPixel& packedArgb) restrict(amp) 
{
    RgbPixel rgb;
    rgb.b = packedArgb & 0xFF;
    rgb.g = (packedArgb & 0xFF00) >> 8;
    rgb.r = (packedArgb & 0xFF0000) >> 16;
    return rgb;
}