Function code aligned to 16 bytes

Question

On most x86-64 achitectures, code to execute is obtained from memory by aligned lines of 16 bytes (see “Instruction fetch” sections). This means that an incoming branch will start with the largest number of prefetched and decoded instructions if the destination is a multiple of 16-bytes. When executions does not transition (fall through) from preceding code to the label, as is the case for the beginning of a function, the padding instructions do not matter and it appears that aligning the label is often a gain. Optimizers often do it unless they are told to optimize for code size (but it can still be a loss for the reason you state in your question: it reduces code density and makes the various caches less efficient).

The case can also be made for aligning branch destinations that can be reached by fall-though (typically the beginning of a loop). In this case, the trade-off is even less likely to be favorable, as some nop instructions will need to be executed during the fall-through that would not have been there if the destination had not been aligned. There are tricks to create long nop instructions that decode faster than multiple short nop instructions, but this is still unhelpful on average, and optimizing compilers only do this if explicitly instructed to (GCC's -falign-loops option for instance, as opposed to -falign-functions. Scroll down to the discussion of -falign-* options on this page).