Question

From some time now, I have searched and read a lot about memory alignment, how it work and how to use it. The most relevant article I have find for now is this one.

But even with that I still have some questions about it:

  1. Out of embedded system, we often have huge chunk of memory in our computer that make memory management a lot less critic, I am completly into optimization, but now, is it really something that can make the difference if we compare the same program with or without it's memory rearranged and aligned?
  2. Is memory alignment have other advantages? I read somewhere that CPU work better/faster with aligned memory because that take it less instructions to process (if one of you have a link for an article/benchmark about it?), in that case, is the difference really significant? Is there more advantages than these two?
  3. In the article link, at chapter 5, the author say:

    Beware: in C++, classes that look like structs may break this rule! (Whether they do or not depends on how base classes and virtual member functions are implemented, and varies by compiler.)

  4. The article talk mostly about structures, but are local variables declaration is also affected by this need?

    Have you any idea of how memory alignment work exactly in C++ since it seem to have some differences?

This former question contains the word "alignment", but it does not provide any answers to the questions above.

Was it helpful?

Solution

Yes both alignment and arrangement of your data can make a big difference in performance, not just a few percent but few to many hundreds of a percent.

Take this loop, two instructions matter if you run enough loops.

.globl ASMDELAY
ASMDELAY:
    subs r0,r0,#1
    bne ASMDELAY
    bx lr

With and without cache, and with alignment with and without cache toss in branch prediction and you can vary those two instructions performance by a significant amount (timer ticks):

min      max      difference
00016DDE 003E025D 003C947F

A performance test you can very easily do yourself. add or remove nops around the code under test and do an accurate job of timing, move the instructions under test along a wide enough range of addresses to touch the edges of cache lines, etc.

Same kind of thing with data accesses. Some architectures complain about unaligned accesses (performing a 32 bit read at address 0x1001 for example), by giving you a data fault. Some of those you can disable the fault and take the performance hit. Others that allow unaligned accesses you just get the performance hit.

It is sometimes "instructions" but most of the time it is clock/bus cycles.

Look at the memcpy implementations in gcc for various targets. Say you are copying a structure that is 0x43 bytes, you may find an implementation that copies one byte leaving 0x42 then copies 0x40 bytes in large efficient chunks then the last 0x2 it may do as two individual bytes or as a 16 bit transfer. Alignment and target come into play if source and destination addresses are on the same alignment say 0x1003 and 0x2003, then you could do the one byte, then 0x40 in big chunks then 0x2, but if one is 0x1002 and the other 0x1003, then it gets real ugly and real slow.

Most of the time it is bus cycles. Or worse the number of transfers. Take a processor with a 64 bit wide data bus, like ARM, and do a four word transfer (read or write, LDM or STM) at address 0x1004, that is a word aligned address, and perfectly legal, but if the bus is 64 bits wide it is likely that the single instruction will turn into three transfers in this case a 32 bit at 0x1004, a 64 bit at 0x1008 and a 32 bit at 0x100A. But if you had the same instruction but at address 0x1008 it could do a single four word transfer at address 0x1008. Each transfer has a setup time associated. So the 0x1004 to 0x1008 address difference by itself can be several times faster, even/esp when using a cache and all are cache hits.

Speaking of, even if you do a two word read at address 0x1000 vs 0x0FFC, the 0x0FFC with cache misses is going to cause two cache line reads where 0x1000 is one cache line, you have the penalty of a cache line read anyway for a random access (reading more data than using) but then that doubles. How your structures are aligned or your data in general and your frequency of accessing that data, etc, can cause cache thrashing.

You can end up striping your data such that as you process the data you can create evictions, you could get real unlucky and end up using only a fraction of your cache and as you jump through it the next blob of data collides with a prior blob. By mixing up your data or re-arranging functions in the source code, etc you can create or remove collisions, since not all caches are created equal the compiler isnt going to help you here it is on you. Even detecting the performance hit or improvement is on you.

All the things we have added to improve performance, wider data busses, pipelines, caches, branch prediction, multiple execution units/paths, etc. Will most often help, but they all have weak spots, that can be exploited either intentionally or accidentally. There is very little the compiler or libraries can do about it, if you are interested in performance you need to tune and one of the biggest tuning factors is alignment of the code and the data, not just aligned on 32, 64, 128, 256 bit boundaries, but also where things are relative to each other, you want heavily used loops or re-used data to not land in the same cache way, they each want their own. Compilers can help for example ordering of instructions for a super scalar architecture, re-arranging instructions that relative to each other dont matter, can make a big performance gain or hit if you are not efficiently using the execution paths, but you have to tell the compiler what you are running on.

The biggest oversight is the assumption that the processor is the bottleneck. Has not been true for a decade or more, feeding the processor is the problem and that is where issues like alignment performance hits, cache thrashing, etc come into play. With a little work even at the source code level, re-arranging data in a structure, ordering of variable/struct declarations, ordering of functions within the source code, and a little extra code to align data, can improve performance several times over or more.

OTHER TIPS

Yes, memory alignment still matters.

Some processors actually can't perform reads on non-aligned addresses. If you're running on such hardware, and you store your integers non-aligned, you're likely to have to read them with two instructions followed by some more instructions to get the various bytes into the right places so you can actually use it. So aligned data is performance-critical.

The good news is that you mostly don't actually have to care. Almost any compiler for almost any language will be producing machine code which respects the target system's alignment requirements. You only need to start thinking about it if you're taking direct control of the in-memory representation of your data, which is not necessary anywhere near as often as it once was. It's an interesting thing to know about, and absolutely critical to know if you want to understand memory usage from various structures you're creating, and how to maybe reorganise things to be more efficient (avoiding padding). But unless you need that kind of control (and for most systems you just don't), you can happily go through an entire career not knowing or caring about it.

Yes, it still matters, and in some performance critical algorithms, you can not rely on the compiler.

I am going to list only few examples:

  1. From this answer:

Normally, the microcode will fetch the proper 4-byte quantity from memory, but if it's not aligned, it will have to fetch two 4-byte locations from memory and reconstruct the desired 4-byte quantity from the appropriate bytes of the two locations

  1. The SSE set of instructions require special alignment. If it is not met, you have to use special functions to load and store data into unaligned memory. That means two extra instructions.

If you are not working on a performance critical algorithms, just forget about memory alignments. It is not really needed for normal programming.

We tend to avoid situations where it matters. If it matters, it matters. Unaligned data used to happen for example when processing binary data, which seems to be avoided nowadays (people use XML or JSON a lot).

IF you somehow manage to create an unaligned array of integers, then on a typical intel processor your code processing that array will run a bit slower than for aligned data. On an ARM processor it runs a bit slower if you tell the compiler the data is unaligned. It may either run an awful, awful lot slower or give wrong results, depending on the processor model and operating system, if you use unaligned data without telling the compiler.

Explaining the reference to C++: In C, all the fields in a struct must be stored in ascending memory order. So if you have fields char / double / char and want to have everything aligned, you will have one byte char, seven byte unused, eight byte double, one byte char, seven byte unused. In C++ structs it's the same for compatibility. But for structs, the compiler may reorder fields, so you might have one byte char, another byte char, six byte unused, 8 byte double. Using 16 instead of 24 bytes. In C structs, developers would usually avoid that situation and have the fields in a different order in the first place.

Many good points are already mentioned in above answers. Just to add even in non-embedded systems which deal with data search/mining the performance of memory matters and access times are so important that other than alignment assembly code is written for same.

I also recommend a worthwhile read : http://dewaele.org/~robbe/thesis/writing/references/what-every-programmer-should-know-about-memory.2007.pdf

How important is memory alignment? Does it still matter?

Yes. No. It depends.

Out of embedded system, we often have huge chunk of memory in our computer that make memory management a lot less critic, I am completly into optimization, but now, is it really something that can make the difference if we compare the same program with or without it's memory rearranged and aligned?

Your application will have a smaller memory footprint and work faster if it's properly aligned. In the typical desktop application, it won't matter outside of rare/atypical cases (like your application always ending with the same performance bottleneck and requiring optimizations). That is, the app will be smaller and faster if properly aligned, but for most practical cases it shouldn't affect the user one way or another.

Is memory alignment have other advantages? I read somewhere that CPU work better/faster with aligned memory because that take it less instructions to process (if one of you have a link for an article/benchmark about it?), in that case, is the difference really significant? Is there more advantages than these two?

It can be. It is something to (possibly) keep in mind while writing code, but in most cases it should simply not matter (that is, I still arrange my member variables by memory footprint and access frequency - which should ease caching - but I do so for ease of use/reading and refactoring the code, not for caching purposes).

Have you any idea of how memory alignment work exactly in C++ since it seem to have some differences?

I read about it when the alignof stuff came out (C++11?) I didn't bother with it since (I am doing mostly desktop applications and backend server development these days).

Licensed under: CC-BY-SA with attribution
scroll top