Is there a cheaper serializing instruction than cpuid?

Question 1

Have you looked at the rdtscp instruction? This is the read serialized version of rdtsc.

For benchmarking I would recommend to read this whitepaper. It provides a couple of best practices for measuring clock ticks.

Alex(Intel)

Question 2

For ordering rdtsc wrt. other instructions, lfence is sufficient if you don't need to wait for the store buffer to drain. Since always on Intel, since Spectre mitigation on AMD. See solution to rdtsc out of order execution?

rdtscp is also guaranteed to be ordered wrt. earlier instructions (but not later; in practice it's probably microcoded pretty much like lfence;rdtsc in that order, plus uops to write ECX with the processor ID.) It's not an x86 serializing instruction, and doesn't even drain the store buffer. (Which you wouldn't necessarily want for timing anyway.) You can mfence; rdtscp or lock or byte [rsp], 0 ; rdtscp if you want that, or rdtscp; lfence if you want to make sure its few uops can't reorder with later stuff.

See also this Q&A for more about the TSC in general, that it's a fixed frequency, not CPU cycles.

True serializing instructions

To answer the title question about "serializing instructions" in the x86 technical terminology sense,
Alder Lake (and Sapphire Rapids) and later have serialize, which does exactly that and no more.

lfence serializes instruction execution (drains the ROB but not store buffer): See

In a VM, cpuid is a guaranteed vmexit so it's slow. It could possibly be faster to push RSP, RFLAGS, CS, and RIP, and run an iret instruction. I didn't double-check what iret pops so that might not be exactly right.

When you need a true serializing instruction

Cross-modifying code is a case where a proper serializing instruction can matter vs. something like mfence;lfence. After an acquire load sees a release store indicating that the new code is there, you need to run a serializing instruction. Intel's Volume 3 manual, section 8.1.3, guarantees that's sufficient for cross-modifying code to be safe.

I assume that makes sure old code hasn't already been fetched by the front-end. So a serializing instruction might fully nuke the pipeline, or do the equivalent if there's enough tracking of recently instructions in the pipeline to snoop on L1i invalidations. (That extra snooping might not be worth the power since serializing instructions are hopefully rare. The tracking is needed anyway to handle self modifying code, snooping store addresses for being near any instruction in flight.)

mfence (or a lock or byte [rsp],0) + lfence wouldn't necessarily be strong enough since lfence only drains the ROB, concerned with instruction execution not fetch, and mfence deals with data load/store. cpuid is a good bet for this case if you can't use serialize.

(Even an atomic RMW or atomic store within an aligned 8-byte chunk in the writer isn't sufficient. On some microarchitectures, I think unaligned code-fetch of 16-byte chunks from L1i cache is possible, so the reader might tear at any boundary.)

Question 3

The answer is apparently not. The Intel Manual, Volume 3a lists only 3 non-privileged serializing instructions (cpuid, iret, and rsm), and the latter two seem to have control-flow side-effects.

Question 4

Well,I guess this is helpfull:lfence.Ref this 《64-ia-32-architectures-software-developer-manual》 Vol.2B 4-301