some possible reasons:
- shared memory bank conflicts (which you don't have)
- constant memory conflicts (i.e. different threads in a warp requesting different locations in constant memory from the same instruction)
- warp-divergent code (if..then..else taking differnt paths for different threads in a warp)
This presentation may be of interest, especially slides 8-11.