Cost of swapping variables through mov, xor

Question 1

Pulling it into two registers then writing back swapping the contents is likely the fastest of the solutions. Four memory cycles, four instructions, two registers. assuming the data has to start in and return to ram, then you cant beat this approach in general.

Four xors assuming you could do memory for sources and destinations, is three cycles per xor, 12 memory cycles, that is a definite loser. using registers to avoid two mem operands just adds more instructions.

Your asm pseudocode is 6 memory cycles. 6 instructions one register. The four cycles, four instructions two registers is likely cheaper. Now if you have to do two memory cycles to free up those registers it becomes 6 cycles. where this last one would be one additional to free up the register so 7. 6 is still cheaper than 7 and 5 instructions cheaper than 7, instruction size was not counted here but adds to memory cycles although fetching is likely done in an efficient manner (in good sized aligned chunks).

If the data were already in registers, then using a third register and doing the three instruction tmp = a, a = b, b = tmp is three operations three registers and the fastest. But if you simply cant spare a register then four xors is faster.

Thats all a generic high level view, there are likely processors and cache situations, etc that can make one solution that appears to be faster not end up being faster certainly for one test but perhaps in general depending on the situation.

Question 2

There is no reason why the Xor method would be faster, on any machine.

Both methods need to perform two reads and two writes, and the Xor method has ALU+memory overhead.

Question 3

On processors supporting register move-elimination (e.g. - IvyBridge or later generation), the fastest way should be the first (using a temp variable) if you can make the compiler keep these values in registers (you'll have to check the generated assembly to make sure).

This way you avoid not only the memory accesses (although a read-after-write should get forwarded internally, but you still accumulate latencies in the memory unit), you also avoid execution latency. The CPU would simply switch the pointers of the registers themselves in the out-of-order register renamer.

Even without move elimination, register-only moves should be faster. The memory unit has tons of restrictions it has to enforce (collision checks, cache lookups, etc..), longer pipeline and less bandwidth the a regular execution.