Does the nVidia RDMA GPUDirect always operate only physical addresses (in physical address space of the CPU)?

Question

The IOMMU is very useful in that it provides a set of mapping registers. It can arrange for any physical memory to appear within the address range accessible by a device, and it can cause physically scattered buffers to look contiguous to devices, too. This is not good for 3rd party PCI/PCI-Express cards or remote machines attempting to access the raw physical offset of an nVidia GPU, as this may result in not actually accessing the intended regions of memory or inhibiting/restricting such accesses on a per-card basis by the IOMMU unit. This must be disabled, then, because

"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."

-nVidia, Design Considerations for rDMA and GPUDirect

When drivers attempt to utilize the CPU's MMU and map regions of memory mapped I/O (MMIO) for use within kernel-space, they typically keep the returned address from the memory mapping to themselves. Because each driver operates within it's own context or namespace, exchanging these mappings between nVidia's driver(s) and other 3rd party vendor's drivers that wish to support rDMA+GPUDirect would be very difficult and would result in a vendor-specific solution (possibly even product-specific if drivers greatly vary between products from the 3rd party). Also, today's operating systems currently don't have any good solution for exchanging MMIO mappings between drivers, thus nVidia exports several functions that allow 3rd party drivers to easily access this information from within kernel-space, itself.

nVidia enforces the use of "physical addressing" to access each card via rDMA for GPUDirect. This greatly simplifies the process of moving data from one computer to a remote system's PCI-Express bus by using that machine's physical addressing scheme without having to worry about problems related to virtual addressing (e.g. resolving virtual addresses to physical ones). Each card has a physical address it resides at and can be accessed at this offset; only a small bit of logic must be added to the 3rd party driver attempting to perform rDMA operations. Also, these 32- or 64-bit Base Address Registers are part of the standard PCI configuration space, so the physical address of the card could easily be obtained by simply reading from it's BAR's rather than having to obtain a mapped address that nVidia's driver obtained upon attaching to the card. nVidia's Universal Virtual Addressing (UVA) takes care of the aforementioned physical address mappings to a seemingly-contiguous region of memory for user-space applications, like so:

CUDA Virtual Address Space

These regions of memory are further divided into three types: CPU, GPU, and FREE, which are all documented here.

Back to your usage case, though: since you're in user-space, you don't have direct access to the physical address space of the system, and the addresses you're using are probably virtual addresses provided to you by nVidia's UVA. Assuming no previous allocations were made, your memory allocation should reside at offset +0x00000000, which would result in you seeing the same offset of the GPU, itself. If you were to allocate a second buffer, I imagine you'd see the this buffer start immediately after the end of the first buffer (at offset +0x00100000 from the base virtual address of the GPU in your case of 1 MB allocations).

If you were in kernel-space, however, and were writing a driver for your company's card to utilize rDMA for GPUDirect, you would use the 32- or 64-bit physical addresses assigned to the GPU by the system's BIOS and/or OS to rDMA data directly to and from the GPU, itself.

Additionally, it may be worth noting that not all DMA engines actually support virtual addresses for transfers -- in fact, most require physical addresses, as handling virtual addressing from a DMA engine can get complex (page 7), thus many DMA engines lack support for this.

To answer the question from your post's title, though: nVidia currently only supports physical addressing for rDMA+GPUDirect in kernel-space. For user-space applications, you will always be using the virtual address of the GPU given to you by nVidia's UVA, which is in the virtual address space of the CPU.

Relating to your application, here's a simplified breakdown of the process you can do for rDMA operations:

Your user-space application creates buffers, which are in the scope of the Unified Virtual Addressing space nVidia provides (virtual addresses).
Make a call to cuPointerGetAttribute(...) to obtain P2P tokens; these tokens pertain to memory inside the context of CUDA.
Send all this information to kernel-space somehow (e.g. IOCTL's, read/write's to your driver, etc). At a minimum, you'll want these three things the end up in your kernel-space driver:
- P2P token(s) returned by cuPointerGetAttribute(...)
- UVA virtual address(es) of the buffer(s)
- Size of the buffer(s)
Now translate those virtual addresses to their corresponding physical addresses by calling nVidia's kernel-space functions, as these addresses are held in nVidia's page tables and can be accessed with function's nVidia's exported, such as: nvidia_p2p_get_pages(...), nvidia_p2p_put_pages(...), and nvidia_p2p_free_page_table(...).
Use these physical addresses acquired in the previous step to initialize your DMA engine that will be manipulating those buffers.

A more in-depth explanation of this process can be found here.