Your code (as is) doesn't compile, below is a fixed version which I think has the same intent If you want to break out the time for copying from the compute time then the simplest thing to do is to use array<> and explicit copies.
int _height, _width;
_height = _width = 3000;
std::vector<int> _main(_height * _width); // host data.
concurrency::extent<2> ext(_height, _width);
// Start timing data copy
concurrency::array<int, 2> GPU_main(ext /* default accelerator */);
concurrency::array<int, 2> GPU_res(ext);
concurrency::array<int, 2> GPU_temp(ext);
concurrency::copy(begin(_main), end(_main), GPU_main);
// Finish timing data copy
int number = 20000;
// Start timing compute
for(int i=0; i < number; ++i)
{
concurrency::parallel_for_each(ext,
[=, &GPU_res, &GPU_main](index<2> idx)restrict(amp)
{
GPU_res(idx) = GPU_main(idx) + idx[0];
});
concurrency::copy(GPU_res, GPU_temp); // Swap arrays on GPU
concurrency::copy(GPU_main, GPU_res);
concurrency::copy(GPU_temp, GPU_main);
}
GPU_main.accelerator_view.wait(); // Wait for compute
// Finish timing compute
// Start timing data copy
concurrency::copy(GPU_main, begin(_main));
// Finish timing data copy
Note the wait() call to force the compute to finish. Remember that C++AMP commands usually queue work on the GPU and it is only guarenteed to have executed if you explicitly wait, with wait(), or for it or implicitly wait by calling (for example) synchronize() on an array_view<>. To get a good idea of timing you should really time the compute and data copies separately (as shown above). You can find some basic timing code here: http://ampbook.codeplex.com/SourceControl/changeset/view/100791#1983676 in Timer.h There are examples of it's use in the same folder.
However. I'm not sure I would really write the code this way unless I wanted to break out the copy and compute times. It is far simpler to use array<> for data that lives purely on the GPU and array_view<> for data that is copied to and from the GPU.
This would look like the code below.
int _height, _width;
_height = _width = 3000;
std::vector<int> _main(_height * _width); // host data.
concurrency::extent<2> ext(_height, _width);
concurrency::array_view<int, 2> _main_av(_main.size(), _main);
concurrency::array<int, 2> GPU_res(ext);
concurrency::array<int, 2> GPU_temp(ext);
concurrency::copy(begin(_main), end(_main), _main_av);
int number = 20000;
// Start timing compute and possibly copy
for(int i=0; i < number; ++i)
{
concurrency::parallel_for_each(ext,
[=, &GPU_res, &_main_av](index<2> idx)restrict(amp)
{
GPU_res(idx) = _main_av(idx) + idx[0];
});
concurrency::copy(GPU_res, GPU_temp); // Swap arrays on GPU
concurrency::copy(_main_av, GPU_res);
concurrency::copy(GPU_temp, _main_av);
}
_main_av.synchronize(); // Will wait for all work to finish
// Finish timing compute & copy
Now the data that is only required on the GPU is declared to be on the GPU and the data that needs to be synchronized is declared as such. Clearer and less code.
You can find out more about this by reading my book on C++ AMP :)