Understanding the implementation of memcpy()

Question 1

The code is testing whether the addresses are aligned suitably for a UINT. If so, the code copies using UINT objects. If not, the code copies using BYTE objects.

The test works by first performing a bitwise OR of the two addresses. Any bit that is on in either address will be on in the result. Then the test performs a bitwise AND with sizeof(UINT) - 1. It is expected the the size of a UINT is some power of two. Then the size minus one has all lower bits on. E.g., if the size is 4 or 8, then one less than that is, in binary 11₂ or 111₂. If either address is not a multiple of the size of a UINT, then it will have one of these bits on, and the test will indicate it. (Usually, the best alignment for an integer object is the same as its size. This is not necessarily true. A modern implementation of this code should use _Alignof(UINT) - 1 instead of the size.)

Copying with UINT objects is faster, because, at the hardware level, one load or store instruction loads or stores all the bytes of a UINT (likely four bytes). Processors will typically copy faster when using these instructions than when using four times as many single-byte load or store instructions.

This code is of course implementation dependent; it requires support from the C implementation that is not part of the base C standard, and it depends on specific features of the processor it executes on.

A more advanced memcpy implementation could contain additional features, such as:

If one of the addresses is aligned but the other is not, use special load-unaligned instructions to load multiple bytes from the one address, with regular store instructions to the other address.
If the processor has Single Instruction Multiple Data instructions, use those instructions to load or store many bytes (often 16, possibly more) in a single instruction.

Question 2

The code

((((ADDRESS) s) | ((ADDRESS) d) | c) & (sizeof(UINT) - 1))

Checks to see if either s, d, or c are not aligned to the size of a UINT.

For example, if s = 0x7ff30b14, d = 0x7ffa81d8, c = 256, and sizeof(UINT) == 4, then:

s         = 0b1111111111100110000101100010100
d         = 0b1111111111110101000000111011000
c         = 0b0000000000000000000000100000000
s | d | c = 0b1111111111110111000101111011100
(s | d | c) & 3 =                        0b00

So both pointers are aligned. It is easier to copy memory between pointers that are both aligned, and this does it with only one branch.

On many architectures, *(UINT *) ptr is much faster if ptr is correctly aligned to the width of a UINT. On some architectures, *(UINT *) ptr will actually crash if ptr is not correctly aligned.