Removing the volatile keyword from that code could break that code on Fermi and Kepler GPUS. Those GPUs lack instructions to directly operate on shared memory. Instead, the compiler must emit a load/store pair to and from register.
What the volatile keyword does in this context is make the compiler honour that load-operate-store cycle and not perform an optimisation that would keep the value of s_data[tid]
in register. To keep the sum accumulating in register would break the implicit memory syncronisation required to make that warp level shared memory summation work correctly.