ARM: Why do I need to push/pop two registers at function calls?

Question 1

what's the reason for the "dummy register" as they call it? Why not simply push{lr} and pop{pc}? They say it's to keep the stack 8-byte aligned, but ain't the stack 4-byte aligned?

~~The stack only requires 4-byte alignment; but~~ if the data bus is 64 bits wide (as it is on many modern ARMs), it's more efficient to keep it at an 8-byte alignment. Then, for example, if you call a function that needs to stack two registers, that can be done in a single 64-bit write rather than two 32-bit writes.

UPDATE: Apparently it's not just for efficiency; it's a requirement of the official procedure call standard, as noted in the comments.

If you're targetting older 32-bit ARMs, then the extra stacked register might degrade performance slightly.

what register is "ip" (i.e., r7 or what?)

r12. See, for example, here for the full set of register aliases used by the procedure call standard.

Question 2

8-byte alignment is a requirement for interoperability between objects conforming AAPCS.

ARM has an advisory note on this subject:

ABI for the ARM® Architecture Advisory Note – SP must be 8-byte aligned on entry to AAPCS-conforming functions

Article mentions two reasons to use 8 byte alignment

Alignment fault or UNPREDICTABLE behavior. (Hardware / Architecture related reasons - LDRD / STRD could cause an Alignment Fault or show UNPREDICTABLE behavior on architectures other than ARMv7)
Application failure. (Compiler - Runtime assumption differences, they give va_start and va_arg as an example)

Of course this is all about public interfaces, if you are making a static executable with no additional linking you can align stack at 4 bytes.

Question 3

Since you want to store and recover them after you execute your function. On the function entrence it saves the ip and lr registers (named prolog). After finishing the function it assigns both (epilog) :

pc <- lr

ip <- old_ip

EDIT

Register r12 is also referred to as IP, and is used as an intra-procedure call scratch register, see also.

The convention is that the callee function can change ip,r0-r3 so you must restore them dependes on the calling convention

EDIT2: Why we might want the stack to be 8 aligned on ARM

If the stack is not eight-byte aligned the use of LDRD and STRD (load and store doubleword) might cause an alignment fault, depending on the target and configuration used.

Note that we have the same issue on X86, and on Mac OS we have 16 bytes alignment