Cuda zero-copy performance

Question

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:

The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
The memory operation had thread address divergence requiring access to multiple cache lines.
The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

The latency to

L2 is 200-400 cycles
device memory (dram) is 400-800 cycles
zero copy memory over PCIe is 1000s of cycles

The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

The profiler's exposes the following counters:

gld_throughput
l1_cache_global_hit_rate
dram_{read, write}_throughput
l2_l1_read_hit_rate

In the zero copy case all of these metrics should be much lower.

The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).