Question

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache (servicing global and constant memory).

We all know about the bank conflicts of accessing Shared Memory - the memory is divided into 32 banks of size 32-bits to allow simultaneous independent access by all 32 threads. On the other hand, Global Memory, though much slower, does not experience bank conflicts because memory requests are coalesced across the warp.

Question: Suppose some data from global or constant memory is cached in the L1 cache for a given warp. Is access to this data subject to bank conflicts, like Shared Memory (since the L1 Cache and the Shared Memory are in fact the same hardware), or is it bank-conflict-free in the way that Global/Constant memory is?

Was it helpful?

Solution

On NVIDIA's 2.x architecture, each warp has 64kb of memory that is by default partitioned into 48kb of Shared Memory and 16kb of L1 cache

Compute capability 2.x devices have 64 KB of SRAM per Streaming Multiprocessor (SM) that can beconfigured as

  • 16 KB L1 and 48 KB shared memory, or
  • 48 KB L1 and 16 KB shared memory.

(servicing global and constant memory).

Loads and stores to global memory, local memory, and surface memory go through the L1. Accesses to constant memory go through dedicated constant caches.

We all know about the bank conflicts of accessing Shared Memory - the memory is divided into 32 banks of size 32-bits to allow simultaneous independent access by all 32 threads. On the other hand, Global Memory, though much slower, does not experience bank conflicts because memory requests are coalesced across the warp.

Accesses through L1 to global or local memory are done per cache line (128 B). When a load request is issued to L1 the LSU needs to perform an address divergence calculation to determine which threads are accessing the same cache line. The LSU unit then has to perform a L1 cache tag look up. If the line is cached then it is written back to the register file; otherwise, the request is sent to L2. If the warp has threads not serviced by the request then a replay is requested and the operation is reissued with the remaining threads.

Multiple threads in a warp can access the same bytes in the cache line without causing a conflict.

Question: Suppose some data from global or constant memory is cached in the L1 cache for a given warp.

Constant memory is not cached in L1 it is cached in the constant caches.

Is access to this data subject to bank conflicts, like Shared Memory (since the L1 Cache and the hared Memory are in fact the same hardware), or is it bank-conflict-free in the way that global/Constant memory is?

L1 and the constant cache access a single cache line at a time so there are no bank conflicts.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top