Вопрос

The Microsoft compiler seems to generate x64 code with functions (as opposed to data) aligned to 16 bytes, i.e. every function except the last in an object file has its code padded with 0xCC (an interrupt instruction, presumably for easier debugging) up to the next 16 byte boundary.

Why is this? Does it actually improve performance? If so, how? Intuitively I would have expected if anything it should slightly reduce performance for cache reasons.

Это было полезно?

Решение

On most x86-64 achitectures, code to execute is obtained from memory by aligned lines of 16 bytes (see “Instruction fetch” sections). This means that an incoming branch will start with the largest number of prefetched and decoded instructions if the destination is a multiple of 16-bytes. When executions does not transition (fall through) from preceding code to the label, as is the case for the beginning of a function, the padding instructions do not matter and it appears that aligning the label is often a gain. Optimizers often do it unless they are told to optimize for code size (but it can still be a loss for the reason you state in your question: it reduces code density and makes the various caches less efficient).

The case can also be made for aligning branch destinations that can be reached by fall-though (typically the beginning of a loop). In this case, the trade-off is even less likely to be favorable, as some nop instructions will need to be executed during the fall-through that would not have been there if the destination had not been aligned. There are tricks to create long nop instructions that decode faster than multiple short nop instructions, but this is still unhelpful on average, and optimizing compilers only do this if explicitly instructed to (GCC's -falign-loops option for instance, as opposed to -falign-functions. Scroll down to the discussion of -falign-* options on this page).

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top