Hans Passant answered correctly. The push/pop opcodes can be broken down into two micro-ops which do a memory move and an increment/decrement of the stack pointer. If the stack pointer - or any pointer - is updated and then immediately used in the next opcode, an execution stall generally occurs. By accessing the individual memory locations through the stack pointer - as in your example - there would be no stall and the operations could be paired allowing them to be executed simultaneously.
Any superscalar CPU type will attempt to execute multiple opcodes in a single cycle if their results/sources have nothing to do with one another. The compiler is doing something for you to speed up execution that would be fairly laborious to do by hand. The opcodes may occupy more space than pushes, but they will execute roughly twice as fast - all other things being the same.