The error here was to profile the code in debug mode (-G
compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.
One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck
should be used in release mode too.