Strange pointer arithmetic

Question 1

Your CPU architecture must support unaligned load and store operations.

To the best of my knowledge, it doesn't (and I've been using STM32, which is an ARM-based cortex).

If you try to read a uint32_t value from an address which is not divisible by the size of uint32_t (i.e. not divisible by 4), then in the "good" case you will just get the wrong output.

I'm not sure what's the address of your buffer, but at least one of the three uint32_t read attempts that you describe in your question, requires the processor to perform an unaligned load operation.

On STM32, you would get a memory-access violation (resulting in a hard-fault exception).

The data-sheet should provide a description of your processor's expected behavior.

UPDATE:

Even if your processor does support unaligned load and store operations, you should try to avoid using them, as it might affect the overall running time (in comparison with "normal" load and store operations).

So in either case, you should make sure that whenever you perform a memory access (read or write) operation of size N, the target address is divisible by N. For example:

uint08_t x = *(uint08_t*)y; // 'y' must point to a memory address divisible by 1
uint16_t x = *(uint16_t*)y; // 'y' must point to a memory address divisible by 2
uint32_t x = *(uint32_t*)y; // 'y' must point to a memory address divisible by 4
uint64_t x = *(uint64_t*)y; // 'y' must point to a memory address divisible by 8

In order to ensure this with your data structures, always define them so that every field x is located at an offset which is divisible by sizeof(x). For example:

struct
{
    uint16_t a; // offset 0, divisible by sizeof(uint16_t), which is 2
    uint08_t b; // offset 2, divisible by sizeof(uint08_t), which is 1
    uint08_t a; // offset 3, divisible by sizeof(uint08_t), which is 1
    uint32_t c; // offset 4, divisible by sizeof(uint32_t), which is 4
    uint64_t d; // offset 8, divisible by sizeof(uint64_t), which is 8
}

Please note, that this does not guarantee that your data-structure is "safe", and you still have to make sure that every myStruct_t* variable that you are using, is pointing to a memory address divisible by the size of the largest field (in the example above, 8).

SUMMARY:

There are two basic rules that you need to follow:

Every instance of your structure must be located at a memory address which is divisible by the size of the largest field in the structure.
Each field in your structure must be located at an offset (within the structure) which is divisible by the size of that field itself.

Exceptions:

Rule #1 may be violated if the CPU architecture supports unaligned load and store operations. Nevertheless, such operations are usually less efficient (requiring the compiler to add NOPs "in between"). Ideally, one should strive to follow rule #1 even if the compiler does support unaligned operations, and let the compiler know that the data is well aligned (using a dedicated #pragma), in order to allow the compiler to use aligned operations where possible.
Rule #2 may be violated if the compiler automatically generates the required padding. This, of course, changes the size of each instance of the structure. It is advisable to always use explicit padding (instead of relying on the current compiler, which may be replaced at some later point in time).

Question 2

LDR is the ARM instruction to load data. You have lied to the compiler that the pointer is a 32bit value. It is not aligned properly. You pay the price. Here is the LDR documentation,

If the address is not word-aligned, the loaded value is rotated right by 8 times the value of bits [1:0].

See: 4.2.1. LDR and STR, words and unsigned bytes, especially the section Address alignment for word transfers.

Basically your code is like,

  p = (uint32_t*)((char*)buffer + 0);
  p = (p>>16)|(p<<16);
  debug("%x ", *p);   // prints 0134fe2a

but has encoded to one instruction on the ARM. This behavior is dependent on the ARM CPU type and possibly co-processor values. It is also highly non-portable code.

Question 3

It's called "undefined behavior". Your code is casting a value which is not a valid unsigned long * into an unsigned long *. The semantics of that operation are undefined behavior, which means pretty much anything can happen*.

In this case, the reason two of your examples behaved as you expected is because you got lucky and buffer happened to be word-aligned. Your third example was not as lucky (if it was, the other two would not have been), so you ended up with a pointer with extra garbage in the 2 least significant bits. Depending on the version of ARM you are using, that could result in an unaligned read (which it appears is what you were hoping for), or it could result in an aligned read (using the most significant 30 bits) and a rotation (word rotated by the number of bytes indicated in the least significant 2 bits). It looks pretty clear that the later is what happened in your 3rd example.

Anyway, technically, all 3 of your example outputs are correct. It would also be correct for the program to crash on all 3 of them.

Basically, don't do that.

A safer alternative is to write the bytes into a uint32_t. Something like:

uint32_t w;
memcpy(&w, buffer, 4);
debug("%x ", w);
memcpy(&w, buffer+4, 4);
debug("%x ", w);
memcpy(&w, buffer+2, 4);
debug("%x ", w);

Of course, that's still assuming sizeof(uint32_t) == 4 && CHAR_BITS == 8, but that's a much safer assumption. (Ie, it should work on pretty much any machine with 8 bit bytes.)