How to use thrust min_element algorithm without memcpys between device and host

Question 1

I have found a (disappointing) answer to my own question:

I found this quote from someone on the CUDA development team [link]

"I am not a Thrust expert, so take this feedback with a grain of salt; but I think this design element of Thrust deserves to be revisited. Thrust is expressive and useful in ways that sometimes are undermined by the emphasis on returning results to the host. I've had plenty of occasions where I wanted to do an operation strictly in device memory, so Thrust's predisposition toward returning a value to host memory actually got in the way; and if I want results returned to the host, I can always pass in a mapped device pointer (which, if UVA is in effect, means any host pointer that was allocated by CUDA)"

..so it looks like I may be out of luck. If so, what a design flaw in thrust!

Question 2

Im not sure if you are still interested in this, but I believe I have done what you wanted it just casting the CUdeviceptr variable. (And telling thrust to use the device) Here it is with a reduction, and I believe thrust doesnt make any extra copies :)

extern int GPUReduceCudaManage(CUdeviceptr d_data, unsigned int numElements)
{

 thrust::plus<int> binary_op_plus;

 int result = thrust::reduce(thrust::device,
                (int*) d_data,
                (int*) d_data + numElements,
                 0,
                 binary_op_plus);


return result;
}