How to synchronize threads with each other in C++ AMP?

Question

As the commentators above have pointed out Fibonacci is one of the worst possible candidates for parallelizing on a GPU. The best you can hope for here is to use synchronization primitives to execute one operation after another and serialize all the threads!

The C++ AMP runtime makes no guarantees as to the order it schedules thread tiles on the GPU. Suppose it decides to schedule the tiles calculating the top half of the series first? The lower ones will never run and the kernel will hang. There is no equivalent of wait() in C++ AMP so the thread(s) will not block giving the runtime no indication that it should swap out blocked threads/tiles and run others. You are essentially using while as a substitute for wait. This is a bad idea on either a CPU or GPU as you introduce a race condition as memory is being read and written concurrently.

However, your question is a valid one in the context of general thread synchronization. With C++ AMP you basically have two mechanisms; barriers and atomic operations.

The following example shows the use of an atomic to count the number of random numbers in the theData array that are larger than 0.999. The array is initialized with random numbers between 0.0 and 1.0 so this number is quite small.The means that the cost of blocking on atomic operations is only incurred infrequently.

array_view<float, 1> theDataView(int(theData.size()), theData);
int exceptionalOccurrences = 0;
array_view<int> count(1, &exceptionalOccurrences);
parallel_for_each(theDataView.extent, [=] (index<1> idx) restrict(amp)
{
    if (theDataView[idx] >= 0.999f) // Exceptional occurrence.
    {
        atomic_fetch_inc(&count(0));
    }
    theDataView[idx] = // Update the value...
});
count.synchronize();

While this shows how to use an atomic these is a more efficient solution to this problem using a (map and) reduction operation to avoid the use of atomics completely.

The following example uses a barrier to synchronize memory accesses across threads within a tile. The first half of the code before the barrier uses all threads to load data into tile_static memory before using a barrier to cause all threads to wait until all the data is loaded before accessing the data.

array<float, 2> inData(1000, 1000);
array<float, 2> outData(1000, 1000);

parallel_for_each(view, 
    inData.extent.tile<tileSize, tileSize>(), [=, &inData, &outData]
    (tiled_index<tileSize, tileSize> tidx) restrict(amp)
    { 
        tile_static float localData[tileSize][tileSize];
        localData[tidx.local[1]][tidx.local[0]] = inData[tidx.global];

        tidx.barrier.wait();
        index<2> outIdx(index<2>(tidx.tile_origin[1], 
                        tidx.tile_origin[0]) + tidx.local);
        outData[outIdx] = localData[tidx.local[0]][tidx.local[1]];
    });

So atomics give you the ability to share memory safely across all threads at the expense of some significant synchronization overhead. Atomics are non-blocking. Barriers allow you to synchronize execution and memory access across threads within a tile, they are blocking. There is no feature of C++AMP that combines the two, you cannot block across all threads on the GPU.

This is a fundamental assumption of the programming model. All threads are subject to minimal synchronization and can run in little or no dependencies between threads. Most of this is covered in the chapters on Optimization and Performance in my C++ AMP book.