OpenCV provides a matrix reduction API implemented with CUDA. You can find it here.
http://docs.opencv.org/modules/gpu/doc/matrix_reductions.html#gpu-reduce
If you don't want to include extra 3rd party libraries, you could use cublas. In this case, your task can be represented by matlab code as follows.
result(1:M) = sum(images(1:N*N, 1:M), 1);
which is equivalent to
result(1:M) = ones(1, N*N) * images(1:N*N, 1:M);
It's a matrix-vector multiply operation and can be efficiently done by BLAS 2 function cublas<t>gemv()
provided by CUBLAS.
http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-gemv
On the other hand, using reduce_by_key()
for your task does not need to generate an additional array of image indices. Fancy iterators in Thrust are designed for this situation to reduce the global mem bandwidth requirement.
Please refer to this answer for more details.