Frage

I have a C++ serial code which simulates a physics algorithm. The C++ code is serially optimised. What is the standard procedure and documentation that is followed while parallelising the code or for GPU acceleration? I am accelerating the code on Nvdia Tesla K40 & Intel Knights Landing cluster.

War es hilfreich?

Lösung

There is, indeed, a standard first-step to this task.

Measure first.

List of performance analysis tools on Wikipedia

This will give you an overview of what are the most intensive computations, the most memory I/O and/or latency-intensive computations, whether these computations are integer-heavy or floating-point heavy. Also, try to find out whether the code uses SIMD.


When prioritizing what to parallelize, there will be conflicts between two preferences:

  • Optimize the heaviest things first, because of Pareto principle (or, 80-20 rule)
  • Optimize the easiest-to-optimize things first, because, it's easy and it's doable without costing too much time/effort.

It is hard to say how to choose the middle ground between these preferences.


Check if you're allowed to use OpenMP. It is probably the easiest way to enable multithread parallelization.


Check for, and eliminate, race conditions that will cause code to give wrong result or crash.


Andere Tipps

I don't know about "standard", per se, but if you're building an nvidia-gpu-based box, look at the cuda libraries, https://developer.nvidia.com/gpu-accelerated-libraries In particular, you mention "physics algorithm", which often involves lots of matrix calculations, whereby https://developer.nvidia.com/arrayfire might be most helpful.

Rather than taking your existing codebase and parallelizing it piece-by-piece, you may need to high-level re-analyze your algorithm(s) in terms of the available gpu accelerated libraries, and then decompose your problem in such a way as to make the best possible use of these library components.

Lizenziert unter: CC-BY-SA mit Zuschreibung
scroll top