See my answer to will array_view.synchronize_asynch wait for parallel_for_each completion? for an explanation as to why parallel_for_each
can be though of as a queuing or scheduling operation rather than a synchronous one. This explains why your code should satisfy your requirements 1 & 2. It should also meet requirement 3, although you might want to consider having one function that are restrict(cpu, amp)
as this will give you less code to maintain.
However you may want to consider some of the performance implications of your approach.
Firstly, the parallel_for_each
only queues work, the data copies from the host and GPU memory use host resources (assuming your GPU is discrete and/or does not support direct copy). If your work on the host saturates all the resources required to keep the GPU working then you may actually slow up your GPU calculation.
Secondly, for many calculations that are data parallel and amenable to running on a GPU they are so much faster that the additional overhead of trying to run work on the CPU doesn't result in an overall speedup. Overhead includes item one (above) and the additional overhead of coordinating work on the host (scheduling threads, merging the results, etc.).
Finally your implementation above does not take into account any variability in the time taken to run tasks on the GPU and CPU. It assumes that 800 AMP tasks will take as long as 200 cpu tasks. This may be true on some hardware but not on others. If one set of tasks takes longer than expected then your application will block and wait for the slower set of tasks to complete. You can avoid this using a master/worker pattern to pull tasks from a queue until there are no more available tasks. This approach means that in the worst case your application will have to wait for the final task to complete, not a block of tasks. Using the master/worker approach also means that your application will run with equal efficiency regardless of the relative CPU/GPU performance.
My book discusses examples of scheduling work across multiple GPUs using a master/worker (n-body) and parallel queue (cartoonizer). You can download the source code from CodePlex. Note that it deliberately does not cover sharing work on both CPU and GPU for the reasons outlined above based on discussions with the C++ AMP product team.