Copy data from GPU to CPU

Question 1

Your code (as is) doesn't compile, below is a fixed version which I think has the same intent If you want to break out the time for copying from the compute time then the simplest thing to do is to use array<> and explicit copies.

        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        // Start timing data copy
        concurrency::array<int, 2> GPU_main(ext /* default accelerator */);
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), GPU_main);
        // Finish timing data copy
        int number = 20000;
        // Start timing compute
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &GPU_main](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = GPU_main(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);       // Swap arrays on GPU
            concurrency::copy(GPU_main, GPU_res);
            concurrency::copy(GPU_temp, GPU_main);
        }
        GPU_main.accelerator_view.wait(); // Wait for compute
        // Finish timing compute
        // Start timing data copy
        concurrency::copy(GPU_main, begin(_main));
        // Finish timing data copy

Note the wait() call to force the compute to finish. Remember that C++AMP commands usually queue work on the GPU and it is only guarenteed to have executed if you explicitly wait, with wait(), or for it or implicitly wait by calling (for example) synchronize() on an array_view<>. To get a good idea of timing you should really time the compute and data copies separately (as shown above). You can find some basic timing code here: http://ampbook.codeplex.com/SourceControl/changeset/view/100791#1983676 in Timer.h There are examples of it's use in the same folder.

However. I'm not sure I would really write the code this way unless I wanted to break out the copy and compute times. It is far simpler to use array<> for data that lives purely on the GPU and array_view<> for data that is copied to and from the GPU.

This would look like the code below.

        int _height, _width;
        _height = _width = 3000;
        std::vector<int> _main(_height * _width); // host data.
        concurrency::extent<2> ext(_height, _width);
        concurrency::array_view<int, 2> _main_av(_main.size(), _main); 
        concurrency::array<int, 2> GPU_res(ext);
        concurrency::array<int, 2> GPU_temp(ext);
        concurrency::copy(begin(_main), end(_main), _main_av);
        int number = 20000;
        // Start timing compute and possibly copy
        for(int i=0; i < number; ++i)
        {
            concurrency::parallel_for_each(ext,
                [=, &GPU_res, &_main_av](index<2> idx)restrict(amp)
            {
               GPU_res(idx) = _main_av(idx) + idx[0];
            });
            concurrency::copy(GPU_res, GPU_temp);  // Swap arrays on GPU
            concurrency::copy(_main_av, GPU_res);
            concurrency::copy(GPU_temp, _main_av);
        }
        _main_av.synchronize();  // Will wait for all work to finish
        // Finish timing compute & copy

Now the data that is only required on the GPU is declared to be on the GPU and the data that needs to be synchronized is declared as such. Clearer and less code.

You can find out more about this by reading my book on C++ AMP :)

Question 2

How did you measure the timing? You need to wait on the accelerator_view after parallel_for_each before doing the copy for accurate timing of computation and copy. You may want to check out the following blog posts for some tips of measuring performance of C++ AMP programs: