When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT
(thanks to @IanH for reminding me of that), and OpenMP compiler directives.
As my extended comment is about distributed memory, however, I'll come to that.
I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like
- Serial outer loop
- Calculate
D
- Distribute
D
to the parallel environment- Inner parallel loop on subsets of
D
- Inner parallel loop on subsets of
- Gather
D
on the master
- Calculate
If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as
- Outer loop
- All processes calculate
D
- Each process works on its subset of
D
- Synchronize
D
- All processes calculate
The communication side, MPI or coarrays (which I'd recommend in this case, again see @IanH's answer, where image synchronization etc., is as limited as a few loops with [..]
) here is just in the synchronization.
As an endnote: multi-machine coarray support is very limited. ifort
as I understand requires an licence beyond the basic, g95
has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.