Why 2 LSB's of 32 bit ARM instruction address not used

Question 1

I think you are again mixing the Instruction access with Data access. As far as data access is concerned we may use the last two bits to fetch any byte among the 4 byte data.

But the concept of not using last two bits has nothing to do with accessing individual byte of a 32 bit instruction. As you said, accessing one byte at a time for instruction access is highly inefficient and is not permitted as well. So to enforce this rule ( of not accessing bytes at odd boundaries in instruction access) the last two bits will not be considered. The following diagram will explain this:

The addresses are 32 bit:

|--0x00000007--|--0x00000006--|--0x00000005--|--0x00000004--|

|--0x00000003--|--0x00000002--|--0x00000001--|--0x00000000--|

Focus on the last nible:

| 3-0011; 2-0010; 1-0001; 0-0000; |

| 7-0111; 6-0110; 5-0101; 4-0100; |

Now focus on the last two least significant bits. Our aim is not to allow an instruction to start at locations 1,2,3,5,6,7 So if you check the two LSB's they cannot be anything in 01,10,11. Only "00" as the 2 LSB's is allowed. Now since they are 00 it is as good as ignoring them when the address generated is in multiples of 4.

Hope you can visualize better.

Question 2

before thumb all arm instructions were 32 bit, 4 bytes, and lets dictate they have to be aligned so the lower two bits are always zero for the instruction addresses. Then thumb comes along, 16 bit instructions so the lower bit of the address is always zero. They added a nuance that when using the bx or blx to switch modes the lsbit is used to distinguish between thumb and arm. If the lsbit is a zero when fed to bx or blx then it stays in or switches to arm mode, if 1 it stays in or switches to thumb mode. Note that lsbit is stripped off the address when placed in the pc it is consumed. While running in either mode the pc lsbit is always zero and bit one is always zero in arm mode.

arm busses are typically 32 or 64 bits wide and it is not a variable word instruciton set like an x86, etc, it is with thumb2 now but, isnt quite the same. So you are not extracting individual bytes and then extracting more bytes to isolate instructions. (not that a modern variable length instruction set does it that inefficiently). So an arm may fetch something like 8 instructions at a time which would be 4 clock cycles (once the handshakes are over) on the 64 bit data bus. That is cache off of course, with the cache it is same or more than that. Each core/architecture is different in its fetches, the memory controller has to handle all the valid cycle types from one byte on any lane on up to the width of the bus.

I dont know what you mean by banks? As programmers we think in terms of byte based addresses as a byte is our smallest addressable item. When you get to the actual rams hardware folks start stripping off address bits they are not using so their lsbit may be different than ours. When you write a single byte some processor busses wont put the whole byte address on the bus they may only put the word or double word address on the bus (2 or 3 lsbits of zero) and then use a byte mask to tell which byte lanes contain new data and which byte lanes you have to preserve at the target.

The amba/axi bus cycles are described on the amba/axi bus documentation at arms website infocenter.arm.com it describes in detail how each transaction works. Not very complicated at all...

Question 3

Note that the question title is only true for a couple of specific architecture versions (ARMv3 and ARMv4, in 32-bit modes) - from ARMv4T, the LSB of branch addresses is used for ARM/Thumb interworking, as @dwelch has noted. On v6M and v7M an attempt to switch instruction sets is not ignored, and results in a fault.

Prior to v3 when the address space was only 26 bits and there was no dedicated CPSR, the bottom two bits of r15 were used to store the processor mode (with the flags in the top 6 bits) - a flag-setting write to r15 would update both the PC and the PSR bits.