How to use thrust min_element algorithm without memcpys between device and host

StackOverflow https://stackoverflow.com/questions/21459147

  •  05-10-2022
  •  | 
  •  

Pregunta

I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device.

Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element, there is a DtoH (device to host) memcpy. What I would like is for everything to be conducted only on the device. In other words, the output of min_element() should be stored on the device, where I can use it later, without suffering the cost of the small DtoH memcpy. Is there a way to do this? Or am I thinking about things the wrong way?

My attempt to do this is below, where the idea is to place the index of the smallest element in the array pointed at by input_ptr into the first element of the array pointed to by output_ptr. Everything should be done on the device, nothing on the host.

This code produces the right answer, but involving unwanted memcpys. Many thanks in advance for any help you can provide.

#include <thrust/extrema.h>
#include <thrust/device_vector.h>
#include <cuda.h>

void my_min_element(CUdeviceptr input_ptr, int length, CUdeviceptr output_ptr)
{
  thrust::device_ptr<float> i_ptr((float*)input_ptr);
  thrust::device_ptr<int> o_ptr((int*)output_ptr);
  o_ptr[0] = thrust::distance(i_ptr,thrust::min_element(i_ptr, i_ptr+length));
}
¿Fue útil?

Solución

I have found a (disappointing) answer to my own question:

I found this quote from someone on the CUDA development team [link]

"I am not a Thrust expert, so take this feedback with a grain of salt; but I think this design element of Thrust deserves to be revisited. Thrust is expressive and useful in ways that sometimes are undermined by the emphasis on returning results to the host. I've had plenty of occasions where I wanted to do an operation strictly in device memory, so Thrust's predisposition toward returning a value to host memory actually got in the way; and if I want results returned to the host, I can always pass in a mapped device pointer (which, if UVA is in effect, means any host pointer that was allocated by CUDA)"

..so it looks like I may be out of luck. If so, what a design flaw in thrust!

Otros consejos

Im not sure if you are still interested in this, but I believe I have done what you wanted it just casting the CUdeviceptr variable. (And telling thrust to use the device) Here it is with a reduction, and I believe thrust doesnt make any extra copies :)

extern int GPUReduceCudaManage(CUdeviceptr d_data, unsigned int numElements)
{

 thrust::plus<int> binary_op_plus;

 int result = thrust::reduce(thrust::device,
                (int*) d_data,
                (int*) d_data + numElements,
                 0,
                 binary_op_plus);


return result;
}
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top