Does the nVidia RDMA GPUDirect always operate only physical addresses (in physical address space of the CPU)?

StackOverflow https://stackoverflow.com/questions/19841815

Question

As we know: http://en.wikipedia.org/wiki/IOMMU#Advantages

Peripheral memory paging can be supported by an IOMMU. A peripheral using the PCI-SIG PCIe Address Translation Services (ATS) Page Request Interface (PRI) extension can detect and signal the need for memory manager services.

enter image description here

But when we use nVidia GPU with CUDA >= 5.0, we can use RDMA GPUDirect, and know that:

http://docs.nvidia.com/cuda/gpudirect-rdma/index.html#how-gpudirect-rdma-works

Traditionally, resources like BAR windows are mapped to user or kernel address space using the CPU's MMU as memory mapped I/O (MMIO) addresses. However, because current operating systems don't have sufficient mechanisms for exchanging MMIO regions between drivers, the NVIDIA kernel driver exports functions to perform the necessary address translations and mappings.

http://docs.nvidia.com/cuda/gpudirect-rdma/index.html#supported-systems

RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view. This makes it incompatible with IOMMUs and hence they must be disabled for RDMA for GPUDirect to work.

And if we allocate and mapping CPU-RAM to the UVA, as here:

#include <iostream>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

int main() {
    // Can Host map memory
    cudaSetDeviceFlags(cudaDeviceMapHost);  

    // Allocate memory
    unsigned char *host_src_ptr = NULL;
    cudaHostAlloc(&host_src_ptr, 1024*1024, cudaHostAllocMapped);
    std::cout << "host_src_ptr = " << (size_t)host_src_ptr << std::endl;

    // Get UVA-pointer
    unsigned int *uva_src_ptr = NULL;
    cudaHostGetDevicePointer(&uva_src_ptr, host_src_ptr, 0);
    std::cout << "uva_src_ptr  = " << (size_t)uva_src_ptr << std::endl;

    int b;  std::cin >> b;
    return 0;
}

We get equal pointers in Windwos7x64, that means that cudaHostGetDevicePointer() do nothing:

host_src_ptr = 68719476736

uva_src_ptr = 68719476736

What does it mean "sufficient mechanisms for exchanging MMIO regions between drivers", what mechanism is here meant, and why I can not use IOMMU by using the virtual address to access via PCIe to the physical region of BAR - another memory mapped device via PCIe?

And does this mean that the RDMA GPUDirect always operates only physical addresses (in physical address space of the CPU), but why we send to the kernel-function uva_src_ptr which is equal to host_src_ptr - simple pointer in CPU's virtual address space?

Was it helpful?

Solution

The IOMMU is very useful in that it provides a set of mapping registers. It can arrange for any physical memory to appear within the address range accessible by a device, and it can cause physically scattered buffers to look contiguous to devices, too. This is not good for 3rd party PCI/PCI-Express cards or remote machines attempting to access the raw physical offset of an nVidia GPU, as this may result in not actually accessing the intended regions of memory or inhibiting/restricting such accesses on a per-card basis by the IOMMU unit. This must be disabled, then, because

"RDMA for GPUDirect currently relies upon all physical addresses being the same from the PCI devices' point of view."

-nVidia, Design Considerations for rDMA and GPUDirect

When drivers attempt to utilize the CPU's MMU and map regions of memory mapped I/O (MMIO) for use within kernel-space, they typically keep the returned address from the memory mapping to themselves. Because each driver operates within it's own context or namespace, exchanging these mappings between nVidia's driver(s) and other 3rd party vendor's drivers that wish to support rDMA+GPUDirect would be very difficult and would result in a vendor-specific solution (possibly even product-specific if drivers greatly vary between products from the 3rd party). Also, today's operating systems currently don't have any good solution for exchanging MMIO mappings between drivers, thus nVidia exports several functions that allow 3rd party drivers to easily access this information from within kernel-space, itself.

nVidia enforces the use of "physical addressing" to access each card via rDMA for GPUDirect. This greatly simplifies the process of moving data from one computer to a remote system's PCI-Express bus by using that machine's physical addressing scheme without having to worry about problems related to virtual addressing (e.g. resolving virtual addresses to physical ones). Each card has a physical address it resides at and can be accessed at this offset; only a small bit of logic must be added to the 3rd party driver attempting to perform rDMA operations. Also, these 32- or 64-bit Base Address Registers are part of the standard PCI configuration space, so the physical address of the card could easily be obtained by simply reading from it's BAR's rather than having to obtain a mapped address that nVidia's driver obtained upon attaching to the card. nVidia's Universal Virtual Addressing (UVA) takes care of the aforementioned physical address mappings to a seemingly-contiguous region of memory for user-space applications, like so:

CUDA Virtual Address Space

These regions of memory are further divided into three types: CPU, GPU, and FREE, which are all documented here.

Back to your usage case, though: since you're in user-space, you don't have direct access to the physical address space of the system, and the addresses you're using are probably virtual addresses provided to you by nVidia's UVA. Assuming no previous allocations were made, your memory allocation should reside at offset +0x00000000, which would result in you seeing the same offset of the GPU, itself. If you were to allocate a second buffer, I imagine you'd see the this buffer start immediately after the end of the first buffer (at offset +0x00100000 from the base virtual address of the GPU in your case of 1 MB allocations).

If you were in kernel-space, however, and were writing a driver for your company's card to utilize rDMA for GPUDirect, you would use the 32- or 64-bit physical addresses assigned to the GPU by the system's BIOS and/or OS to rDMA data directly to and from the GPU, itself.

Additionally, it may be worth noting that not all DMA engines actually support virtual addresses for transfers -- in fact, most require physical addresses, as handling virtual addressing from a DMA engine can get complex (page 7), thus many DMA engines lack support for this.

To answer the question from your post's title, though: nVidia currently only supports physical addressing for rDMA+GPUDirect in kernel-space. For user-space applications, you will always be using the virtual address of the GPU given to you by nVidia's UVA, which is in the virtual address space of the CPU.


Relating to your application, here's a simplified breakdown of the process you can do for rDMA operations:

  1. Your user-space application creates buffers, which are in the scope of the Unified Virtual Addressing space nVidia provides (virtual addresses).
  2. Make a call to cuPointerGetAttribute(...) to obtain P2P tokens; these tokens pertain to memory inside the context of CUDA.
  3. Send all this information to kernel-space somehow (e.g. IOCTL's, read/write's to your driver, etc). At a minimum, you'll want these three things the end up in your kernel-space driver:
    • P2P token(s) returned by cuPointerGetAttribute(...)
    • UVA virtual address(es) of the buffer(s)
    • Size of the buffer(s)
  4. Now translate those virtual addresses to their corresponding physical addresses by calling nVidia's kernel-space functions, as these addresses are held in nVidia's page tables and can be accessed with function's nVidia's exported, such as: nvidia_p2p_get_pages(...), nvidia_p2p_put_pages(...), and nvidia_p2p_free_page_table(...).
  5. Use these physical addresses acquired in the previous step to initialize your DMA engine that will be manipulating those buffers.

A more in-depth explanation of this process can be found here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top