Data alignment vs. cache locality

https://stackoverflow.com/questions/14056437

12-12-2021
|

Question

From memory, data may only be read in the natural word size of the architecture. For example, on a 32-bit system, data is read from memory in 4-byte chunks. If a 2-byte or 1-byte value is added to memory, their reading will still require accessing a 4-byte word. (In case of the 2-byte value, two 4-byte accesses might be required, if the value was stored on a word boundary.)

Therefore, an access to an individual value is the fastest when it requires accessing a single word, and minimal additional work (such as masking) is required. If I'm correct, this is the reason virtual machines (such as JVM or Android's Dalvik) lay out member variables in 4-byte boundaries in Object instances.

Another concept is cache friendliness, i.e. locality (e.g. L1, L2). If many values must be traversed/processed directly after each other, it is beneficial that they are stored close to each other (ideally, in a contiguous block). This is spatial locality. If this isn't possible, at least the operations on the same value should be done in the same time period (temporal locality -- i.e. it has a high chance that the value is retained in cache while the operations are performed on it).

As far as I can see, the above two concepts can be "contradictory" in some cases, and the choice between them depends on their usage scenario. For example, a smaller amount of contiguous data is more cache friendly than a greater amount (trivial), yet if random access is commonly required on some data, the word-aligned (but greater-sized) structure might be beneficial -- unless the whole structure fits in the cache. Therefore, whether locality (~arrays) or alignment benefits should be preferred depends on how the values will be manipulated, I think.

There is a scenario which is interesting for me: let's assume a pathfinding algorithm which receives the input graph (and other auxiliary structures) as arrays. (Most of its input arrays store values that are <= 32767.)

The pathfinding algorithm performs very many random accesses on the arrays (in several loops). In this sense, an int[] might be desired for input data (on Android/ARM), because the values will be on word boundary when accessed. (On the other hand, if sequential traversals were needed, then a smaller datatype would be recommended -- especially for large arrays -- because of a higher probability of cache-friendliness.)

However, what if the (randomly accessed) input data would fit L1/L2 if specified as a short[], but not fit if specified as int[]? In such a case, would the advantage of 4-byte alignment of int[] for random access be outweighted by the cache-friendliness of short[]?

In a concrete application, of course, I'd make measurements for comparison. That wouldn't necessarily answer the above questions, however.

Solution

If you can assure that moving to short leads to a significatn better locality (aka everything is in cache), this outweights alignment penalties.

access to cache is in the low nanos <10ns, access to ram is 60-80ns

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow