It seems based on the comments that the only working solution is to turn off L1 caching. This can be accomplished on a program-wide basis by passing the following switch to nvcc when compiling:
–Xptxas –dlcm=cg
The L1 caches are a property/resource of the SM, not the device as a whole. Since threadblocks execute on specific SMs, the activity of one threadblock in its L1 cache can be incoherent from the activity of another threadblock and its L1 cache (assuming it happens to be running on a different SM), even though they are both referencing the same locations in global memory. L1 caches in different SMs have no connection with each other and are not guaranteed to be coherent with each other.
Note that the L2 cache is device-wide and therefore "coherent" from the perspective of individual threadblocks. Turning off L1 caching has no effect on L2 caching, so there is still the possibility of some caching benefit, however the time required to satisfy a request out of L2 is longer than the time required to satisfy a request out of L1, so turning off L1 caching program-wide is a pretty large hammer to try to get things working.
The volatile
keyword in front of a variable definition should have the effect of telling the compiler to skip L1 caching on loads (according to my understanding). But volatile by itself doesn't address the write path, so it's possible for one threadblock in one SM to do a volatile
read, pulling a value out of L2, modify that value, and then write it back, where it ends up in L1 (until it is evicted). If another threadblock reads the same global value, it may not see the effect of the update.
Diligent use of __threadfence() while tedious, should force any such updates out of L1 into L2, so that other threadblocks can read them. However this still leaves a synchronization gap from when the value was written to when it is observable by other SMs/threadblocks.
(Global) Atomics should also have the effect of going directly to "global memory" to read and write the values used.
It may be instructive to also go through the code to ensure that every possible read from a globally synchronized location is handled properly (e.g. with volatile
or using atomics) and that every possible write to a globally synchronized location is handled properly (e.g. with __threadfence()
or atomics), and also check for race conditions between different blocks.
As discovered, the process of creating a stable globally-synchronized environment within the GPU is non-trivial. These other questions may also be of interest (e.g. with respect to Kepler) (and e.g. discussing global semaphores).
Edit: To respond to a question posted in the comments, I would say this:
Perhaps there's no issue. However __threadfence()
provides no guarantee (that I know of) for a maximum completion time. Therefore at the moment an update is made to a global location, only the L1 associated with the executing threadblock/SM gets updated. Then we hit the __threadfence()
. Presumably threadfence takes some time to complete, and during this time another threadblock could be resident on the same SM, brought in for execution (while the previous thread/warp/block is stalled at the threadfence), and "see" the updated global value in the (local) L1 associated with that SM. Other threadblocks executing in other SMs will see the "stale" value until the __threadfence()
completes. This is what I am referring to as a possible "synchronization gap". Two different blocks can still see two different values, for a brief period of time. Whether this matters or not will be dependent on how the global value is being used for synchronization between blocks (since that is the topic under discussion.) Therefore atomics + volatile may be a better choice than volatile + threadfence, to try and cover both read and write paths for synchronization.
Edit #2: It seems from the comments that the comination of the use of atomics plus volatile
also solved the problem.