With CUDA SDK 5.5 I can use to copying data:

  • from host: cudaMemcpy(); to use GPU-DMA if memory pinned
  • from host: memcpy(); or cudaMemcpy(); to use CPU Cores if memory isn't pinned
  • from gpu: for() { dst[i] = src[i]; } or memcpy(); to use GPU Cores
  • from gpu: to use GPU-DMA ???

How can I use GPU-DMA in kernel-function of GPU-CUDA code to copying data?

有帮助吗?

解决方案

What you are trying to do is so not possible from device side if it does not support compute capability 3.5. If you have such a card see edit.

Yes you can access GPU RAM from another device by passing a device pointer allocated on another device to your kernel. Than the execution runtime will provide the requested data onto the right GPU. However, this isn't very efficient because every access to another devices memory results in a memcopy operation either peer-to-peer or device-host-device.

What you can do is to perform prefetch data from within your host code and use different streams for your memcopy operations (cudaMemcpy_async) and your kernel executions. However this works only if you have a decent card with one separated copy unit and you have to do explicit locking because there are no build in structures that will hold your kernel until the data transfer is finished.

EDIT:

If you have a compute capbility 3.5 device you can use the cuda device runtime for memcopy from device to device within your device code. See the dynamic parallelism documentation here: http://docs.nvidia.com/cuda/pdf/cuda_dynamic_parallelism_programming_guide.pdf Note that all memcopy operations on the device are also asynchronous. And you will heave to preserve data coherence again on your own.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top