How are global atomic operations implemented in Kepler? I got less performance using gmem rather than using atomics

Question 1

This cache line update is apparently more costly (in your particular code) than just the second global atomic access.

A single global atomic access from a single SM to global memory on Kepler GK110 (e.g. K20) is actually quite fast.

As indicated in the Kepler white paper, Kepler has improved speed of global atomics as compared to Fermi.

Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock.

Question 2

Atomics have "fire-and-forget" semantics. This means that the kernel calls the atomic operation and lets the actual atomic operation be executed by the cache (not on the the SM), and the kernel will move on the the next instruction without waiting for the actual atomic operation to complete. This only works if there is no return value from the atomic operation, which is the case in this example. The fire-and-forget semantics let the SM get on with it's computations, offloading the computation of the atomic to the cache.

This is great if another thread is not going to use that location. And it opens up the possibility to have a thread handle multiple data locations quickly because if you have several atomic operations in sequence, the thread can just fire them off. Have them in adjacent memory and memory bandwidth will likely be reduced by coalescing.