Is that the code you used? All it does is read data from 16 MB. I ran it on my PC, where 16 MB is from RAM, calculating MB/second, which was 993 at stride 2, reducing to 880 at stride 999. Based on measuring microseconds running time, your time calculation produced 0.0040 at stride 2, increasing 0.0045 at stride 999.
There are all sorts of reasons for speed reductions at increased strides like, burst reading, cache alignment and different memory banks.