What's the difference between the x86 NOP and FNOP instructions?

https://stackoverflow.com//questions/25008772

20-12-2019
|

Question

I was reading the Intel instruction manual and noticed there is a 'NOP' instruction that does nothing on the main CPU, and a 'FNOP' instruction that does nothing on the FPU. Why are there two separate instructions to do nothing?

The only thing different I saw was they throw different exceptions, so you might watch for an exception from FNOP to detect whether there's an FPU available. But aren't there other mechanisms like CPUID to detect this? What practical reason is there to have two separate NOP instructions?

Solution

Expanding on Raymond Chen and Hans Passant's comments, there are historical reasons for there being two separate instructions and why they don't quite have the same effect.

Neither of the two instructions, NOP and FNOP, were originally designed as an explicit no-operation instruction. The NOP instruction is actually just an alias for the instruction XCHG AX,AX. (Or in 32-bit mode XCHG EAX, EAX.) On early Intel processors it didn't actually do nothing. While it had no externally visible effect, internally it was executed just like an XCHG instruction, taking as many cycles to execute. The '486 was the first Intel CPU to treat it specially, it could execute a NOP in 1 cycle, while it took 3 cycles to execute any other register-to-register XCHG instruction.

Treating XCHG AX,AX instruction specially becomes very important in modern Intel processors. If it were still actually exchanging the same register with itself, it could introduce pipeline stalls if a nearby instruction also used the AX register. By treating it specially the CPU doesn't end up thinking the NOP needs to wait for a previous instruction that sets AX or that a following instruction needs to wait for the NOP.

This brings up the fact that there are lots of different instructions that do nothing, though XCHG AX,AX is the only one that's a single byte (as a special case of the the exchange-register-with-accumulator single byte XCHG encodings). Often these instructions are used as single instruction substitute for consecutive NOP instructions, like when aligning the start of loop for performance reasons. For example if you wanted a 6 byte NOP you could use LEA EAX,[EAX + 00000000]. Intel eventually added an explicit multiple byte NOP instruction. (Well, not so much added as officially documented an instruction that had been there since the Pentium Pro.) However only the single byte form is treated specially; the multiple byte NOPs will generate stalls if nearby instructions use the same registers.

When AMD added 64-bit support to their CPUs they went even further. NOP is no longer the equivalent of XCHG EAX,EAX in 64-bit mode. One of the problems with the Intel instruction set is that there are a lot of instructions that modify only part of register. For example MOV BX,AX only modifies the lower 16-bits of EBX leaving the upper 16-bits unmodified. These partial modifications make it hard for the CPU avoid stalls, so AMD decide to prevent that when using 32-bit instructions in 64-bit mode. Whenever the result of a 32-bit operation is stored in a (64-bit) register, the value is zero extended to 64-bits so that entire register is modified. This means XCHG EAX,EAX is no longer a NOP, as it clears the upper 32-bits of EAX (and thus if you explicitly write XCHG EAX,EAX, it can't assemble to 0x90 and has to use the 87 C0 encoding). In 64-bit mode NOP is now an explicit NOP with no other interpretation.

As for the FNOP instruction, on the original 8087 it's not entirely clear how the FPU treated this instruction, but I'm pretty sure it wasn't handled as an explicit no-operation either. At least one old Intel manual, the ASM86 Language Rerefence Manual does document as doing something with no effect ("stores the stack top to the stack top"). From its position in the opcode map it looks like it might an alias for either FST ST or FLD ST, both of which would copy the top of the stack to the top of the stack. However it did get some special treatment, it executed in an average of 13 cycles instead of the average 18 or 20 cycles for a stack to stack FST or FLD instruction respectively. If it were being treated as no-operation instruction I'd expect it be even faster, as there are a number of 8087 instructions that can execute in half the time.

More importantly the FNOP instruction behaves differently than NOP because of how FPU instructions used to be implemented on Intel processors. The CPU itself didn't support floating-point arithmetic, instead these duties were offloaded onto an optional floating-point coprocessor, originally the 8087. One of the nice things about the coprocessor was that it executed instructions in parallel with the CPU. However this means that the CPU sometimes needs to wait for the FPU to finish an operation. The CPU automatically waits for it to finish executing the previous instruction before giving it another instruction, but a program would need to explicitly wait (using a WAIT instruction) before it could read a result that the coprocessor wrote to memory.

Because the coprocessor worked in parallel this also meant that if an FPU instruction generated a floating-point exception, by the time it detected this the CPU would already have moved on to execute the next instruction. Normally when an instruction generates an exception on the CPU, it's handled while that instruction is still being executed, but when an FPU instruction generates an exception the CPU has already completed executing that instruction by handing it off to the FPU. Instead of interrupting the CPU and delivering the floating-point exception asynchronously, the CPU is only notified when it waits for the coprocessor, either explicitly or implicitly.

In modern processors the FPU is no longer a coprocessor, it's an integral part of the CPU. This means programs no longer have to wait for the FPU to write values to memory. However how FPU exceptions are handled hasn't changed. (It turns out that delivering exceptions immediately is difficult to implement on modern CPUs so they took advantage of the one case where they didn't have to.) So if a previous FPU instruction generated an undelivered floating-point exception, a NOP leave the exception undelivered, while FNOP, because it's an FPU instruction, will do an implicit "wait" that results in the floating point exception being delivered.

This example demonstrates the difference:

FLD1       ; push 1.0 onto the FPU stack
FLDZ       ; push 0.0
FDIV       ; divide 1.0 by 0.0
NOP        ; does nothing
NOP        ; does nothing
FNOP       ; signals a FP zero-divide exception and then does nothing

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow