what's the reason for the "dummy register" as they call it? Why not simply push{lr} and pop{pc}? They say it's to keep the stack 8-byte aligned, but ain't the stack 4-byte aligned?
The stack only requires 4-byte alignment; but if the data bus is 64 bits wide (as it is on many modern ARMs), it's more efficient to keep it at an 8-byte alignment. Then, for example, if you call a function that needs to stack two registers, that can be done in a single 64-bit write rather than two 32-bit writes.
UPDATE: Apparently it's not just for efficiency; it's a requirement of the official procedure call standard, as noted in the comments.
If you're targetting older 32-bit ARMs, then the extra stacked register might degrade performance slightly.
what register is "ip" (i.e., r7 or what?)
r12
. See, for example, here for the full set of register aliases used by the procedure call standard.