I do not see justification for maximizing consecutive use of one cache line. The cache does not operate “one line at a time”, and there is generally no advantage to using one cache line repeatedly versus using any of the lines that are in cache.
A better goal is to maximize the number of accesses that are served from a line in L1 cache, instead of needing to be fetched from slower cache or memory. As long as an access “hits” a line currently in cache, we do not care which cache line it is.
The i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is eight-way associative, with 64-byte cache lines. This means the 32,768-byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address maps to exactly one set, as shown below. Within each set, eight cache lines are retained, from lines that have been recently used in that set. (The replacement algorithm is not least-recently-used, but it is an attempt to be useful and may have similar results to least-recently-used.)
Cache lookups work in this way:
- Given a byte address x, let t = floor(x/64) (due to cache line size).
- Let s = t % 64 (to select the set).
- Check set s to see whether it contains the byte at address x.
Consider the effect of row length on these cache lookups. With a row length of 65,536 bytes, the addresses of array elements a[i][j] and a[i+1][j] differ by 65,536 bytes. That means their values for t in the above lookup procedure differ by exactly 1024, and their values for s are identical. Therefore, they map to the same set.
Once the algorithm moves up or down by more than eight rows, without changing columns outside of a cache line, the single cache set being used cannot handle the nine recently used cache lines. One of them must be evicted. In effect, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).
A simple way to address this is to pad the array so that the rows are 65,536+p bytes long, for some padding amount p. The array would be allocated with extra space and defined with longer rows than normal. The extra columns may generally be ignored. There is no need to initialize them; we do not care about their contents, just the effect they have on the addresses. (Alternately, they could be used for supplementary data if that is convenient for the program.)
With this padding, the distance between a[i][j] and a[i+1][j] is 65,536+p bytes, so the difference in t values is 1024+p/64, and the difference in s values is p/64 % 64. E.g., if p is 64 or 320, the difference in s values is 1 or 5, respectively.
I suggest testing 9*64 for p. Any value 64 or greater will ensure that array elements in the same column in consecutive rows map to different cache sets. However, the algorithm described in the question wanders in columns as well as in rows. So, if p were small, our fix to make consecutive rows map to different cache sets might be negated by column-wandering that meanders back to the same cache set. Other values of p should be tried as well.
This is not intended to be a complete solution to the problem, as there are many factors that affect performance.