In your code, the loop means that "traverse the address in the same row, one by one, then go to next line". But if you reverse the positions of i and j, this means that "traverse the address in the same column, one by one, the go to next column".
In C, multi-dimensional array are put on linear address space, byte by byte, then line by line, so dst[i][j] = src[i][j]
in your case means *(dst + 4096 * i + j) = *(src + 4096 * i + j)
:
*(dst + 4096 * 0 + 0) = *(src + 4096 * 0 + 0);
*(dst + 4096 * 0 + 1) = *(src + 4096 * 0 + 1);
*(dst + 4096 * 0 + 2) = *(src + 4096 * 0 + 2);
//...
while reversed i
and j
means:
*(dst + 4096 * 0 + 0) = *(src + 4096 * 0 + 0);
*(dst + 4096 * 1 + 0) = *(src + 4096 * 1 + 0);
*(dst + 4096 * 2 + 0) = *(src + 4096 * 2 + 0);
//...
So the extra 1 second in second case is cause by accessing memory in a non-contigous manner.
You don't need to do time calculation yourself, because you can run your program with "time" command on linux/UNIX:
$ time ./loop
The results on my linux box for the 2 cases:
$ time ./loop_i_j
real 0m0.244s
user 0m0.062s
sys 0m0.180s
$ time ./loop_j_i
real 0m1.072s
user 0m0.995s
sys 0m0.073s