How are global atomic operations implemented in Kepler? I got less performance using gmem rather than using atomics

StackOverflow https://stackoverflow.com/questions/17404717

  •  02-06-2022
  •  | 
  •  

Domanda

I would like to know the implementation of global atomics in Kepler.

see this piece of code:

1. if (threadIdx.x < workers) {
2.    temp = atomicAdd(dst, temp + rangeOffset);
3.    if (isLastPartialCalc(temp)) {                            
4.        atomicAdd(dst,-300000.0f);
5.    }
6. }

if I change line 4 for this:

*dst -= 300000.0f;

The performance is lower! The change is safe, since no more threads will write on this value (the output are the same).

kernel using atomic: ~883us kernel using gmem directly: ~903us

I have run several times and I always get this ~20us penalty for the change

UPDATE It seems the store without using the atomic always produces a miss in L2 whilst the atomic version always produces a hit ... so I guess trying to write to some location which was flagged (or something) with "atomic" is not allowed in L2 and it makes another request to gmem

È stato utile?

Soluzione 2

This cache line update is apparently more costly (in your particular code) than just the second global atomic access.

A single global atomic access from a single SM to global memory on Kepler GK110 (e.g. K20) is actually quite fast.

As indicated in the Kepler white paper, Kepler has improved speed of global atomics as compared to Fermi.

Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock.

Altri suggerimenti

Atomics have "fire-and-forget" semantics. This means that the kernel calls the atomic operation and lets the actual atomic operation be executed by the cache (not on the the SM), and the kernel will move on the the next instruction without waiting for the actual atomic operation to complete. This only works if there is no return value from the atomic operation, which is the case in this example. The fire-and-forget semantics let the SM get on with it's computations, offloading the computation of the atomic to the cache.

This is great if another thread is not going to use that location. And it opens up the possibility to have a thread handle multiple data locations quickly because if you have several atomic operations in sequence, the thread can just fire them off. Have them in adjacent memory and memory bandwidth will likely be reduced by coalescing.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top