Object oriented programming with C++-AMP

Question

The answer to your question is yes, in that you can pass classes or structs to a lambda marked restrict(amp). Note that the parallel_foreach` is not AMP restricted, its lambda is.

However you are limited to using the types that are supported by the GPU. This is more of a limitation of current GPU hardware, rather than C++ AMP.

A C++ AMP-compatible function or lambda can only use C++ AMP-compatible types, which include the following:

int

unsigned int

float

double

C-style arrays of int, unsigned int, float, or double

concurrency::array_view or references to concurrency::array

structs containing only C++ AMP-compatible types

This means that some data types are forbidden:

bool (can be used for local variables in the lambda)

char

short

long long

unsigned versions of the above

References and pointers (to a compatible type) may be used locally but cannot be captured by a lambda. Function pointers, pointer-to-pointer, and the like are not allowed; neither are static or global variables. Classes must meet more rules if you wish to use instances of them. They must have no virtual func- tions or virtual inheritance. Constructors, destructors, and other nonvirtual functions are allowed. The member variables must all be of compatible types, which could of course include instances of other classes as long as those classes meet the same rules.

... From the C++ AMP book, Ch, 3.

So while you can do this it may not be the best solution for performance reasons. CPU and GPU caches are somewhat different. This makes arrays of structs a better choice of CPU implementations, whereas GPUs often perform better if structs of arrays are used.

GPU hardware is designed to provide the best performance when all threads within a warp are access- ing consecutive memory and performing the same operations on that data. Consequently, it should come as no surprise that GPU memory is designed to be most efficient when accessed in this way. In fact, load and store operations to the same transfer line by different threads in a warp are coalesced into as little as a single transaction. The size of a transfer line is hardware-dependent, but in general, your code does not have to account for this if you focus on making memory accesses as contiguous as possible.

... Ch. 7.

If you take a look at the CPU and GPU implementations of the my n-body example you'll see implementations of both approaches for CPU and GPU.

The above does not mean that your algorithm will not run faster when you move the implementation to C++ AMP. It just means that you may be leaving some additional performance on the table. I would recommend doing the simplest port possible and then consider if you want to invest more time optimizing the code, possibly rewriting it to take better advantage of the GPU's architecture.