How many memory latency cycles per memory access type in OpenCL/CUDA?

https://stackoverflow.com/questions/4097635

28-09-2019
|

Question

I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.

I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.

Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?

What about texture cache?

Solution

The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.

The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.

You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.

Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.

OTHER TIPS

For (Kepler) Tesla K20 the latencies are as follows:

Global memory: 440 clocks
Constant memory
    L1: 48 clocks
    L2: 120 clocks
Shared memory: 48 clocks
Texture memory
    L1: 108 clocks
    L2: 240 clocks

How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.

This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.

There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow