This cache line update is apparently more costly (in your particular code) than just the second global atomic access.
A single global atomic access from a single SM to global memory on Kepler GK110 (e.g. K20) is actually quite fast.
As indicated in the Kepler white paper, Kepler has improved speed of global atomics as compared to Fermi.
Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock.