Reducing GPU-CPU data transfers in C++Amp

Question

So I think there's something else going on here. Have you tried running the original sample on which your code is based? This is available on CodePlex.

Load the samples solution and build the Reduction project in Release mode and then run it without the debugger attached. You should see some output like this.

Running kernels with 16777216 elements, 65536 KB of data ...
Tile size:     512
Tile count:    128
Using device : NVIDIA GeForce GTX 570

                                                           Total : Calc

SUCCESS: Overhead                                           0.03 : 0.00 (ms)
SUCCESS: CPU sequential                                     9.48 : 9.45 (ms)
SUCCESS: CPU parallel                                       5.92 : 5.89 (ms)
SUCCESS: C++ AMP simple model                              25.34 : 3.19 (ms)
SUCCESS: C++ AMP simple model using array_view             62.09 : 20.61 (ms)
SUCCESS: C++ AMP simple model optimized                    25.24 : 1.81 (ms)
SUCCESS: C++ AMP tiled model                               29.70 : 7.27 (ms)
SUCCESS: C++ AMP tiled model & shared memory               30.40 : 7.56 (ms)
SUCCESS: C++ AMP tiled model & minimized divergence        25.21 : 5.77 (ms)
SUCCESS: C++ AMP tiled model & no bank conflicts           25.52 : 3.92 (ms)
SUCCESS: C++ AMP tiled model & reduced stalled threads     21.25 : 2.03 (ms)
SUCCESS: C++ AMP tiled model & unrolling                   22.94 : 1.55 (ms)
SUCCESS: C++ AMP cascading reduction                       20.17 : 0.92 (ms)
SUCCESS: C++ AMP cascading reduction & unrolling           24.01 : 1.20 (ms)

Note that none of the examples are taking anywhere near the time you code is. Although it's fair to say that the CPU is faster and data copy time is a big contributing factor here.

This is to be expected. Effective use of a GPU involves moving more than operations like reduction to the GPU. You need to move significant amount of compute to make up for the copy overhead.

Some things you should consider:

What happens with you run the sample from CodePlex?
Are you running a release build with optimization enabled?
Are you sure running are running against the actual GPU hardware and not against a WARP (software emulator) accelerator?

Some more information that would be helpful

what hardware are you using?
How large is your data set, both the input data and the size of the partial result array?