With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16];
with a[0]
recently loaded, a[1]
in the same cache line as a[0]
, and b[1]
not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1]
and b[0]
in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1]
or b[0]
.
It is also theoretically possible that a compiler that sees your code is loading both a[0]
and a[1]
in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M
is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
- An array
M
has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the arrayfloat M[1024][1024];
,M[0][0]
andM[1][0]
are 4096 bytes apart and map to the same cache set. - As you traverse a column of the array, you access
M[0][0]
,M[1][0]
,M[2][0]
,M[3][0]
, and so on. The cache line for each of these elements is loaded into cache. - As you continue along the column, you access
M[8][0]
,M[9][0]
, and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containingM[0][0]
and so on are evicted from cache. - When you complete the column and start the next column by reading
M[0][1]
, the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).