Optimising 2D array indexing for cache line

Question 1

I do not see justification for maximizing consecutive use of one cache line. The cache does not operate “one line at a time”, and there is generally no advantage to using one cache line repeatedly versus using any of the lines that are in cache.

A better goal is to maximize the number of accesses that are served from a line in L1 cache, instead of needing to be fetched from slower cache or memory. As long as an access “hits” a line currently in cache, we do not care which cache line it is.

The i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is eight-way associative, with 64-byte cache lines. This means the 32,768-byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address maps to exactly one set, as shown below. Within each set, eight cache lines are retained, from lines that have been recently used in that set. (The replacement algorithm is not least-recently-used, but it is an attempt to be useful and may have similar results to least-recently-used.)

Cache lookups work in this way:

Given a byte address x, let t = floor(x/64) (due to cache line size).
Let s = t % 64 (to select the set).
Check set s to see whether it contains the byte at address x.

Consider the effect of row length on these cache lookups. With a row length of 65,536 bytes, the addresses of array elements a[i][j] and a[i+1][j] differ by 65,536 bytes. That means their values for t in the above lookup procedure differ by exactly 1024, and their values for s are identical. Therefore, they map to the same set.

Once the algorithm moves up or down by more than eight rows, without changing columns outside of a cache line, the single cache set being used cannot handle the nine recently used cache lines. One of them must be evicted. In effect, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).

A simple way to address this is to pad the array so that the rows are 65,536+p bytes long, for some padding amount p. The array would be allocated with extra space and defined with longer rows than normal. The extra columns may generally be ignored. There is no need to initialize them; we do not care about their contents, just the effect they have on the addresses. (Alternately, they could be used for supplementary data if that is convenient for the program.)

With this padding, the distance between a[i][j] and a[i+1][j] is 65,536+p bytes, so the difference in t values is 1024+p/64, and the difference in s values is p/64 % 64. E.g., if p is 64 or 320, the difference in s values is 1 or 5, respectively.

I suggest testing 9*64 for p. Any value 64 or greater will ensure that array elements in the same column in consecutive rows map to different cache sets. However, the algorithm described in the question wanders in columns as well as in rows. So, if p were small, our fix to make consecutive rows map to different cache sets might be negated by column-wandering that meanders back to the same cache set. Other values of p should be tried as well.

This is not intended to be a complete solution to the problem, as there are many factors that affect performance.

Question 2

This is probably not useful, but it may be interesting.

You could use Z-order addressing. It will map aligned 8x8 blocks to cache lines, so as long as you stay within one aligned 8x8 block you're always using the same cache line. But weird things sometimes happen when you cross from one block into the next.

Generating a Z-order address from an (x, y) pair is a little annoying:

static uint Interleave(uint x, uint y)
{
    y = (y | (y << 1)) & 0x00FF00FF;
    y = (y | (y << 2)) & 0x0F0F0F0F;
    y = (y | (y << 4)) & 0x33333333;
    y = (y | (y << 8)) & 0x55555555;

    x = (x | (x << 1)) & 0x00FF00FF;
    x = (x | (x << 2)) & 0x0F0F0F0F;
    x = (x | (x << 4)) & 0x33333333;
    x = (x | (x << 8)) & 0x55555555;

    return x | (y << 1);
}

(this is C#, should be easy to convert to C)

Less annoying if you have a CPU that supports PDEP, which is, so far, only Haswell.

But you probably don't need to do that often. You can directly increment or decrement the x or y part of a Z-order address (it's generalizable to adding any pair of constants (c1, c2) to a Z-address, if they are both nonzero it takes a little more code), like this: (those don't do any bounds checking)

static uint IncX(uint z)
{
    uint xsum = (z | 0xAAAAAAAA) + 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint IncY(uint z)
{
    uint ysum = (z | 0x55555555) + 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

static uint DecX(uint z)
{
    uint xsum = (z & 0x55555555) - 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint DecY(uint z)
{
    uint ysum = (z & 0xAAAAAAAA) - 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

You can even throw in some kinds of bounds-checking. I have routines for saturating increment/decrement, I'm only about 90% sure that they work though. Wrapping modulo a power of two is trivial, just do a binary AND on the result.

The addressing with the Z-coordinate is trivial, just add it to the base of the array. Moving is slightly more complicated than in (x, y) space, but if you combine this with the idea from your other post (looking up an area) you never even need to move (except in the computation of that lookup table, obviously). Packing a good surrounding area may be harder. But there is a less-good surrounding area that becomes trivial: offset the Z coordinate directly in Z-space in both directions and take everything in between (say, from Z-8 to Z+7). That will simulate fewer steps at once on average since it's not usually a square block and the current position is usually not in the middle, but the index into the lookup table would be easier to compute.

edit: it's probably better to take an aligned block instead of a range, because the ant can never walk from one of the "parts" of an unaligned range into an other "part" (the parts are, at best, diagonally connected, so it would have to step outside). That's also easy, just AND the least significant bits off of the Z-coordinate to get the start of the aligned block. The lookup table will need those least significant bits though so they have to become part of the index.

I don't expect this approach to win. But it's interesting, IMO.