Aligned and unaligned memory accesses?

https://stackoverflow.com/questions/1063809

21-08-2019
|

Question

What is the difference between aligned and unaligned memory access?

I work on an TMS320C64x DSP, and I want to use the intrinsic functions (C functions for assembly instructions) and it has

ushort & _amem2(void *ptr);
ushort & _mem2(void *ptr);

where _amem2 does an aligned access of 2 bytes and _mem2 does unaligned access.

When should I use which?

Solution

An aligned memory access means that the pointer (as an integer) is a multiple of a type-specific value called the alignment. The alignment is the natural address multiple where the type must be, or should be stored (e.g. for performance reasons) on a CPU. For example, a CPU might require that all two-byte loads or stores are done through addresses that are multiples of two. For small primitive types (under 4 bytes), the alignment is almost always the size of the type. For structs, the alignment is usually the maximum alignment of any member.

The C compiler always puts variables that you declare at addresses which satisfy the "correct" alignment. So if ptr points to e.g. a uint16_t variable, it will be aligned and you can use _amem2. You need to use _mem2 only if you are accessing e.g. a packed byte array received via I/O, or bytes in the middle of a string.

OTHER TIPS

Many computer architectures store memory in "words" of several bytes each. For example, the Intel 32-bit architecture stores words of 32 bits, each of 4 bytes. Memory is addressed at the single byte level, however; therefore an address can be "aligned", meaning it starts at a word boundary, or "unaligned", meaning it doesn't.

On certain architectures certain memory operations may be slower or even completely not allowed on unaligned addresses.

So, if you know your addresses are aligned on the right addresses, you can use _amem2(), for speed. Otherwise, you should use _mem2().

Aligned addresses are those which are multiples of the access size in question.

Access of 4 byte words on addresses that are multiple of 4 will be aligned
Access of 4 bytes from the address (say) 3 will be unaligned access

It is very likely that the _mem2 function which will work also for unaligned accesses will be less optimal to get the correct alignments working in its code. This means that the _mem2 function is likely to be costlier then its _amem2 version.

So, when you need performance (particularly when you know that the access latency is high) it would be prudent to identify when you can use the aligned access. The _amem2 exists for this very purpose -- to give you performance when you know the access is aligned.

When it comes to 2 byte accesses, identifying aligned operations is very simple.
If all the access addresses for the operation are 'even' (that is, their LSB is zero), you have 2-byte alignment. This can be easily checked with,

if (address & 1) // is true
    /* we have an odd address; not aligned */
else
    /* we have an even address; its aligned to 2-bytes */

I know this is an old question with a selected answer but didnt see anyone explain the answer to what is the difference between aligned and unaligned memory access...

Be it dram or sram or flash or other. Take an sram as a simple example it is built out of bits a specific sram will be built out of a fixed number of bits wide and a fixed number of rows deep. lets say 32 bits wide and several/many rows deep.

if I do a 32 bit write to address 0x0000 in this sram, the memory controller around this sram can simply do a single write cycle to row 0.

if I do a 32 bit write to address 0x0001 in this sram, assuming that is allowed, the controller will need to do a read of row 0, modify three of the bytes, preserving one, and write that to row 0, then read row 1 modify one byte leaving the other three as found and write that back. which bytes get modified or not have to do with endianness for the system.

The former is aligned and the latter unaligned, clearly a performance difference plus need the extra logic to be able to do the four memory cycles and merge the byte lanes.

If I were to read 32 bits from address 0x0000 then a single read of row 0, done. But read from 0x0001 and I have to do two reads row0 and row1 and depending on the system design just send those 64 bits back to the processor possibly two bus clocks instead of one. or the memory controller has the extra logic so that the 32 bits are aligned on the data bus in one bus cycle.

16 bit reads are a little better, a read from 0x0000, 0x0001 and 0x0002 would only be a read from row0 and could based on the system/processor design send those 32 bits back and the processor extracts them or shift them in the memory controller so that they land on specific byte lanes so the processor doesnt have to rotate around. One or the other has to if not both. A read from 0x0003 though is like above you have to read row 0 and row1 as one of your bytes is in each and then either send 64 bits back for the processor to extract or the memory controller combines the bits into one 32 bit bus response (assuming the bus between the processor and memory controller is 32 bits wide for these examples).

A 16 bit write though always ends up with at least one read-modify-write in this example sram, address 0x0000, 0x0001 and 0x0002 read row0 modify two bytes and write back. address 0x0003 read two rows modify one byte each and write back.

8 bit you only need to read one row containing that byte, writes though are a read-modify-write of one row.

The armv4 didnt like unaligned although you could disable the trap and the result is not like you would expect above, not important, current arms allow unaligned and give you the above behavior you can change a bit in a control register and then it will abort unaligned transfers. mips used to not allow, not sure what they do now. x86, 68K etc, was allowed and the memory controller may have had to do the most work.

The designs that dont permit it clearly are for performance and less logic at what some would say is a burden on the programmers others might say it is no extra work on the programmer or easier on the programmer. aligned or not you can also see why it can be better to not try to save any memory by making 8 bit variables but go ahead and burn a 32 bit word or whatever the natural size of a register or the bus is. It may help your performance at a small cost of some bytes. Not to mention the extra code the compiler would need to add to make the lets say 32 bit register mimic an 8 bit variable, masking and sometimes sign extension. Where using register native sizes those additional instructions are not required. You can also pack multiple things into a bus/memory wide location and do one memory cycle to collect or write them then use some extra instructions to manipulate between registers not costing ram and a possible wash on the number of instructions.

I dont agree that the compiler will always align the data right for the target, there are ways to break that. And if the target doesnt support unaligned you will hit the fault. Programmers would never need to talk about this if the compiler always did it right based on any legal code you could come up with, there would be no reason for this question unless it was for performance. if you dont control the void ptr address to be aligned or not then you have to use the mem2() unaligned access all the time or you have to do an if-then-else in your code based on the value of the ptr as nik pointed out. by declaring as void the C compiler now has no way to correctly deal with your alignment and it wont be guaranteed. if you take a char *prt and feed it to these functions all bets are off on the compiler getting it right without you adding extra code either buried in the mem2() function or outside these two functions. so as written in your question mem2() is the only correct answer.

DRAM say used in your desktop/laptop tends to be 64 or 72 (with ecc) bits wide, and every access to them is aligned. Even though the memory sticks are actually made up of 8 bit wide or 16 or 32 bit wide chips. (this may be changing with phones/tablets for various reasons) the memory controller and ideally at least one cache sits in front of this dram so that the unaligned or even aligned accesses that are smaller than the bus width read-modify-writes are dealt with in the cache sram which is way faster, and the dram accesses are all aligned full bus width accesses. If you have no cache in front of the dram and the controller is designed for full width accesses then that is the worst performance, if designed for lighting up the byte lanes separately (assuming 8 bit wide chips) then you dont have the read-modify-writes but a more complicated controller. if the typical use case is with a cache (if there is one in the design) then it may not make sense to have that additional work in the controller for each byte lane, but have it just know how to do full bus width sized transfers or multiples of.

_mem2 is more general. It'll work if ptr is aligned or not. _amem2 is more strict: it requires that ptr be aligned (though is presumably slightly more efficient). So use _mem2 unless you can guarantee that ptr is always aligned.

Many processors have alignment restrictions on memory access. Unaligned access either generates an exception interrupt (e.g. ARM), or is just slower (e.g. x86).

_mem2 is probably implemented as fetching two bytes and using shift and or bitwise operations to make a 16-bit ushort out of them.

_amem2 probably just reads the 16-bit ushort from the specified ptr.

I don't know TMS320C64x specifically but I'd guess it requires 16-bit alignment for 16-bit memory accesses. So you can use _mem2 always but with performance penalty, and _amem2 when you can guarantee that ptr is an even address.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow