Question

I need to update some code I used for Aho-Corasick algorithm in order to implement the algorithm using the GPU. However, the code heavily relies on object-oriented programming model. My question is, is it possible to pass objects to parallel for each? If not; is there any way around can be workable and exempt me from re-writing the entire code once again. My apologies if this seems naive question. C++-AMP is the first language I use in GPU programming. Hence, my experience in this field is quiet limited.

Was it helpful?

Solution

The answer to your question is yes, in that you can pass classes or structs to a lambda marked restrict(amp). Note that the parallel_foreach` is not AMP restricted, its lambda is.

However you are limited to using the types that are supported by the GPU. This is more of a limitation of current GPU hardware, rather than C++ AMP.

A C++ AMP-compatible function or lambda can only use C++ AMP-compatible types, which include the following:

  • int
  • unsigned int
  • float
  • double
  • C-style arrays of int, unsigned int, float, or double
  • concurrency::array_view or references to concurrency::array
  • structs containing only C++ AMP-compatible types

This means that some data types are forbidden:

  • bool (can be used for local variables in the lambda)
  • char
  • short
  • long long
  • unsigned versions of the above

References and pointers (to a compatible type) may be used locally but cannot be captured by a lambda. Function pointers, pointer-to-pointer, and the like are not allowed; neither are static or global variables. Classes must meet more rules if you wish to use instances of them. They must have no virtual func- tions or virtual inheritance. Constructors, destructors, and other nonvirtual functions are allowed. The member variables must all be of compatible types, which could of course include instances of other classes as long as those classes meet the same rules.

... From the C++ AMP book, Ch, 3.

So while you can do this it may not be the best solution for performance reasons. CPU and GPU caches are somewhat different. This makes arrays of structs a better choice of CPU implementations, whereas GPUs often perform better if structs of arrays are used.

GPU hardware is designed to provide the best performance when all threads within a warp are access- ing consecutive memory and performing the same operations on that data. Consequently, it should come as no surprise that GPU memory is designed to be most efficient when accessed in this way. In fact, load and store operations to the same transfer line by different threads in a warp are coalesced into as little as a single transaction. The size of a transfer line is hardware-dependent, but in general, your code does not have to account for this if you focus on making memory accesses as contiguous as possible.

... Ch. 7.

If you take a look at the CPU and GPU implementations of the my n-body example you'll see implementations of both approaches for CPU and GPU.

The above does not mean that your algorithm will not run faster when you move the implementation to C++ AMP. It just means that you may be leaving some additional performance on the table. I would recommend doing the simplest port possible and then consider if you want to invest more time optimizing the code, possibly rewriting it to take better advantage of the GPU's architecture.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top