Distributing independant iterations in a subroutine over multiple machines

Question 1

When one has existing code and wants to parallelize incrementally (or just one routine), shared memory approaches are the "quick hit". Especially when it is known that the iterations are independant I'd first recommend looking at compiler flags for auto-parallelization, language constructs such as DO CONCURRENT (thanks to @IanH for reminding me of that), and OpenMP compiler directives.

As my extended comment is about distributed memory, however, I'll come to that.

I'll assume you don't have access to some advanced process-spawning setup on all of your potential machines. That is, you'll have processes running on various machines each being charged for the time regardless of what work is being done. Then, the work-flow looks like

Serial outer loop
- Calculate D
- Distribute D to the parallel environment
  - Inner parallel loop on subsets of D
- Gather D on the master

If the processors/processes in the parallel environment are doing nothing else - or you're being charged regardless - then this is the same to you as

Outer loop
- All processes calculate D
- Each process works on its subset of D
- Synchronize D

The communication side, MPI or coarrays (which I'd recommend in this case, again see @IanH's answer, where image synchronization etc., is as limited as a few loops with [..]) here is just in the synchronization.

As an endnote: multi-machine coarray support is very limited. ifort as I understand requires an licence beyond the basic, g95 has some support, the Cray compiler may well. That's a separate question, however. MPI would be well supported.

Question 2

Depending on the underlying implementation, coarrays (F2008) may allow processing to be distributed over multiple nodes. Partitioning the iteration space across the images is relatively straightforward, communication of the results back to one image (or to all images) is where some complexity might arise. Some introductory material on coarrays can be found here.

Again, depending on the underlying implementation, DO CONCURRENT (F2008) may allow parallel processing of iterations (though unlikely to be across nodes). Restrictions exist on what can be done in the scope of a DO CONCURRENT construct that mean that iterations can be executed in any order, appropriately capable compilers may be able to then transform that further into concurrent execution.