Question

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?

I have a kernel that uses the zero-copy feature and with NVVP I see the following:

Running the kernel on an average problem size I get instruction replay overhead of 0.7%, so nothing major. And all of this 0.7% is global memory replay overhead.

When I really jack up the problem size, I get an instruction replay overhead of 95.7%, all of which is due to global memory replay overhead.

However, the global load efficiency and global store efficiency for both the normal problem size kernel run and the very very large problem size kernel run are the same. I'm not really sure what to make of this combination of metrics.

The main thing I'm not sure of is which statistics in NVVP will help me see what is going on with the zero copy feature. Any ideas of what type of statistics I should be looking at?

Was it helpful?

Solution

Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:

  1. The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
  2. The memory operation had thread address divergence requiring access to multiple cache lines.
  3. The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
  4. The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

The latency to

  • L2 is 200-400 cycles
  • device memory (dram) is 400-800 cycles
  • zero copy memory over PCIe is 1000s of cycles

The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

The profiler's exposes the following counters:

  • gld_throughput
  • l1_cache_global_hit_rate
  • dram_{read, write}_throughput
  • l2_l1_read_hit_rate

In the zero copy case all of these metrics should be much lower.

The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top