That is due to spatial locality. When your program needs some data from the memory, the processor not only reads that specific data but also the neighboring data. So, in the next iteration when you need the next set of data, it is already there in your cache.
In the other case, your program can't take advantage of spatial locality since you are not reading the neighboring data in consecutive iterations.
Say your data is laid out in the memory like:
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
When your program needs to read say data labeled 0
, it reads the entire row:
0 1 2 3 4 5 6 7 8 9
So, that when you need data labeled 1
, it is already in the cache and your program runs faster.
On the contrary if you are reading data column wise, this doesn't help you, each time you get a cache miss and the processor again has to do a memory read.
In short, memory read are costly, this is processor's way of optimizing reads to save time.