How can I use GPU-DMA from GPU-CUDA code to copying data?

Question

What you are trying to do is so not possible from device side if it does not support compute capability 3.5. If you have such a card see edit.

Yes you can access GPU RAM from another device by passing a device pointer allocated on another device to your kernel. Than the execution runtime will provide the requested data onto the right GPU. However, this isn't very efficient because every access to another devices memory results in a memcopy operation either peer-to-peer or device-host-device.

What you can do is to perform prefetch data from within your host code and use different streams for your memcopy operations (cudaMemcpy_async) and your kernel executions. However this works only if you have a decent card with one separated copy unit and you have to do explicit locking because there are no build in structures that will hold your kernel until the data transfer is finished.

EDIT:

If you have a compute capbility 3.5 device you can use the cuda device runtime for memcopy from device to device within your device code. See the dynamic parallelism documentation here: http://docs.nvidia.com/cuda/pdf/cuda_dynamic_parallelism_programming_guide.pdf Note that all memcopy operations on the device are also asynchronous. And you will heave to preserve data coherence again on your own.